[Home] [By Thread] [By Date] [Recent Entries]
I'm looking for references to a specific kind of text algorithm -- the algorithm should generate a number (say, 32 or 64 bits) for any text string of any length, similar to a hash. However, it should be possible to compare the numbers for different strings to tell how close they are to each other. For example, the numbers for 1. To be or not to be. 2. Two bees or not two bees. 3. I don't know whether to be or not to be. should indicate that three strings are relatively close to each other (while a hash number would give no indication at all). I'm asking only out of interest, because I came up with a simple algorithm to do this while I was in the shower yesterday, and it would be fun to see how close it is to what the pros use for spam detection and so on. Note that I'm not looking for algorithms based on edit-distance, bag-of-words, and so on. Thanks in advance, David -- David Megginson, david@m..., http://www.megginson.com/
|

Cart



