Re: Structured from/within unstructured documents

Cart

XML Editor - Download a Free Trial >

See What's New >

Buy Now >

[Home] [By Thread] [By Date] [Recent Entries]

From: Marcus Carr <mcarr@a...>
To: xml-dev@l...
Date: Tue, 18 Dec 2007 10:17:31 +1100

Stephen Green wrote:

> What methods are there, these days, for extracting structured data from
> unstructured documents (such as PDF)?

Maybe I'm missing something, but I didn't see anyone suggest saving the 
PDF as XML straight from Acrobat. If you have a full licence, it does a 
pretty respectable job, getting you paragraph and character tagging, 
tables and images. You can also batch process, converting entire 
directories or what have you. The results are at least as good as saving 
the PDF to something like Word first and you could be forgiven for 
expecting that they might even be better.

Once you're that far, you can get on your XSLT boots...

Marcus

Follow-Ups:
- Re: Re: Structured from/within unstructured documents
  - From: "Edward C. Zimmermann" <edz@b...>
- Re: Re: Structured from/within unstructured documents
  - From: "Stephen Green" <stephengreenubl@g...>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >