[Home] [By Thread] [By Date] [Recent Entries]
md5sum is a cryptographic hash using the MD5 algorithm. It's not fast, but it will do what you want. It's available in linux, in cygwin, and probably other ways. In a reasonable command shell, where unix commands are available along with md5sum, md5sum *.xml | sort will put the duplicate files on neighboring lines. Jeff ----- Original Message ----- From: "Eric Hanson" <eric@a...> To: <xml-dev@l...> Sent: Thursday, April 29, 2004 12:58 PM Subject: hashing > I have a large collection of XML documents, and want to find and > group any duplicates. The obvious but slow way of doing this is > to just compare them all to each other. Is there a better > approach? > > Particularly, is there any APIs or standards for "hashing" a > document so that duplicates could be identified in a similar way > to what you'd do with a hash table?
|

Cart



