[Home] [By Thread] [By Date] [Recent Entries]
ghostscript includes a pstext utility to extract text: it does a reasonable but not 100% accurate job (and includes the full ghostscript postscript interpreter). If you turn off the ps2ascii simple mode (remove the "-dSIMPLE" argument), GhostScript outputs font and positioning information for each string. You can use that information to eliminate headers & footers, identify elements to tag, and so forth. Exegenix (http://exegenix.com/) has a commercial solution for converting PostScript or PDF to XML; it looks intriguing. -- Larry Kollar k o l l a r @ a l l t e l . n e t "The hardest part of all this is the part that requires thinking." -- Paul Tyson, on xml-doc XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
|

Cart



