[Home] [By Thread] [By Date] [Recent Entries]
Friday Governor George W. Bush of Texas posted complete records of his campaign contributions on his web site. However, he deliberately posted them in PDF format so they couldn't be imported into a database or a spreadsheet, and consequently reporters and voters couldn't find out just how much of his money was coming from whom. Or at least that's what he thought. :-) I am pleased to announce, that after a few hours of intense hacking I have succeeded in extracting the crucial information from the PDF files and have posted them online in XML and tab delimited formats for anybody who wants them. Accountants, start your spread sheets! You'll find the files at http://metalab.unc.edu/javafaq/bush/ I've written a very simple DTD for the XML version. <http://metalab.unc.edu/javafaq/bush/donations.dtd> Based on this DTD the results do appear to be well-formed and valid (though I've been burned by misbehaving validators before). The first two validators I tried gave up on trying to parse such a large (more than eight megabytes) document. Interestingly, the initial conversion to XML did turn up some bugs in my PDF-to-text converter program, but the validation of the XML did not find any additional problems. I can see where a schema language would be very useful for this sort of reverse engineering work though. Eventually I may try to cook up a more serious DTD that more closely matches the FEC's actual required format for filing electronic copies of donor lists. I'm also going to try to add a simple XSL stylesheet to these in the near future, but they're so large that they really challenge anyone trying to browse them directly. +-----------------------+------------------------+-------------------+ | Elliotte Rusty Harold | elharo@m... | Writer/Programmer | +-----------------------+------------------------+-------------------+ | The XML Bible (IDG Books, 1999) | | http://metalab.unc.edu/xml/books/bible/ | | http://www.amazon.com/exec/obidos/ISBN=0764532367/cafeaulaitA/ | +----------------------------------+---------------------------------+ | Read Cafe au Lait for Java News: http://metalab.unc.edu/javafaq/ | | Read Cafe con Leche for XML News: http://metalab.unc.edu/xml/ | +----------------------------------+---------------------------------+ xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i... Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1 To (un)subscribe, mailto:majordomo@i... the following message; (un)subscribe xml-dev To subscribe to the digests, mailto:majordomo@i... the following message; subscribe xml-dev-digest List coordinator, Henry Rzepa (mailto:rzepa@i...)
|

Cart



