Re: Structured from/within unstructured documents

Cart

XML Editor - Download a Free Trial >

See What's New >

Buy Now >

[Home] [By Thread] [By Date] [Recent Entries]

From: Jonathan Robie <jonathan.robie@r...>
To: Stephen Green <stephengreenubl@g...>
Date: Sat, 15 Dec 2007 14:31:31 -0500

Stephen Green wrote:
> What methods are there, these days, for extracting structured data from
> unstructured documents (such as PDF)?
>
> [!!! SNIP !!!]
>
> Is this all there is?
>   

Microsoft Word and Open Office both export to XML, and Antiword is a 
program that does a pretty good job of extracting Word files to DocBook.

For PDF, though, I don't know of any really good tools. The following 
page, from someone who has played with the problem, gives a summary of 
what's out there:

http://discerning.com/hacks/docutils/pdf2xml/readme.html

I'd love it if someone would tell me there's something actively 
maintained that does this job in the open source world. I don't know it yet.

Jonathan
Red Hat Enterprise MRG: http://www.redhat.com/mrg/

Follow-Ups:
- Re: Structured from/within unstructured documents
  - From: "Stephen Green" <stephengreenubl@g...>

References:
- Structured from/within unstructured documents
  - From: "Stephen Green" <stephengreenubl@g...>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >