[Home] [By Thread] [By Date] [Recent Entries]
This is an interesting problem. One obvious yet pretty much useless algorithm is the length() function and its variations: word count, unique word count, etc. It works similar to the genome size: measure of information entropy potential. All the song lyrics will be pretty close to each other and so will novels. While useless, it can handle all the languages. <g> Don Park Docuverse > -----Original Message----- > From: David Megginson [mailto:david@m...] > Sent: Sunday, March 09, 2003 6:18 AM > To: xml-dev@l... > Subject: Re: [OT] Looking for a text algorithm > > Miles Sabin writes: > > > Judging from your examples it looks like you're after a closeness > > criterion derived from longest common subsequences. But I don't see how > > you could use that to usefully construct a single characteristic number > > for _any_ string of _any_ length: with only 32 or 64 bits to play with, > > many many completely unrelated (on any criterion) strings will collide > > on the same code. > > Quite right. However, if the alternative is a linear search (say, > using an edit-distance algorithm), then reducing the number of > candidates by a few orders of magnitude would not necessarily be a bad > thing. > > The problem I was considering (in the shower) was detecting spam > messages with minor variations, such as the insertion of the > recipient's e-mail address in the body or the substitution of Zimbabwe > for Nigeria. Assume that I have a database containing many millions > of known spam messages, and that I want to check an incoming e-mail > message against it. If I can narrow the field down to, say, 50 > candidates after a very inexpensive operation, then my system will be > much more efficient; I can then use edit-distance against the closest > matches to see if the message really is likely spam. > > That said, based on private e-mail from another list member, I suspect > that there may be nothing original about the algorithm I came up with; > nevertheless, here it is for anyone who would care to take a peek: > > http://www.megginson.com/private/megginson-index-00.zip > > > All the best, > > > David > > -- > David Megginson, david@m..., http://www.megginson.com/ > > ----------------------------------------------------------------- > The xml-dev list is sponsored by XML.org <http://www.xml.org>, an > initiative of OASIS <http://www.oasis-open.org> > > The list archives are at http://lists.xml.org/archives/xml-dev/ > > To subscribe or unsubscribe from this list use the subscription > manager: <http://lists.xml.org/ob/adm.pl>
|

Cart



