[Home] [By Thread] [By Date] [Recent Entries]
Any instance of an LR(1) language can be processed in pure XSLT and one possible result can be to produce an xml document. See for example the json-document() function of FXSL. This function uses the generic LR(1) parsing system of FXSL: the lr-parse() function. More information can be found here: http://dnovatchev.spaces.live.com/Blog/cns!44B0A32C2CCF7488!367.entry http://www.stylusstudio.com/xsllist/200711/post20640.html Cheers, Dimitre Novatchev "Stephen Green" <stephengreenubl@g...> wrote in message 92040e120712151004n13dec762x770cbe02afa1abb8@m...">news:92040e120712151004n13dec762x770cbe02afa1abb8@m...... > What methods are there, these days, for extracting structured data from > unstructured documents (such as PDF)? > > I'm aware it is quite straightforward to extract data from semi-structured > documents such as spreadsheets (as previous XML-Dev discussions have > shown, such as via ODF with XSLT and macros/Ant/Ant Contrib, etc). > > As yet, the only way I'm aware of for doing the same from PDF would be to > print out to paper and use OCR (sounds a little ridiculous) or maybe to > convert PDF, etc to some XML-based or other text-based print/archive > file somehow and go from there (perhaps with something akin to a screen- > scraper?). > > Is this all there is? > > Plus how does one then convert the data as, say XML into some XML > or equivalent document and embed that in, say, the PDF or equivalent > unstructured document file (for later extraction, say)? > I'd very much appreciate any light on this. Thank you. I'm interested not > so much in metadata but actual data or full structured equivalents of the > unstructured documents rather than just enough data to create an index. > > E.g what about patient records held in PDF and in XML formats and how > to turn the first into the latter and/or embed the latter in the first. > > Best regards > > -- > Stephen Green > > Partner > SystML, http://www.systml.co.uk > Tel: +44 (0) 117 9541606 > > http://www.biblegateway.com/passage/?search=matthew+22:37 .. and voice > > _______________________________________________________________________ > > XML-DEV is a publicly archived, unmoderated list hosted by OASIS > to support XML implementation and development. To minimize > spam in the archives, you must subscribe before posting. > > [Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/ > Or unsubscribe: xml-dev-unsubscribe@l... > subscribe: xml-dev-subscribe@l... > List archive: http://lists.xml.org/archives/xml-dev/ > List Guidelines: http://www.oasis-open.org/maillists/guidelines.php > >
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |

Cart



