[Home] [By Thread] [By Date] [Recent Entries]
Without wishing in any way to vitiate John Cowan's plea for attendees at XML2002, I should like to announce that it is possible to see John for free at the next meeting of the XML SIG, Tuesday 12 November 7 - 9 p.m. at 125 Broad Street in downtown Manhattan. John will be unveiling his project "TagSoup, A SAX Parser For Nasty, Ugly HTML" (abstract below). It is, however, necessary to register in advance in order to reserve a place for this presentation. You may do that by emailing a request to me at mailto:wperry@x.... You will receive a confirmation by return email. XML SIG Presentation 12 November 2002 John Cowan: "TagSoup, A SAX Parser For Nasty, Ugly HTML" John writes: For the last year I have been working on a new parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup is now very close to being ready for its first public Open Source release under the Academic Free License, a cleaned-up and patent-safe BSD-style license which allows proprietary re-use. TagSoup is a parser, not a whole application; it isn't intended to permanently clean up bad HTML, as HTML Tidy does, only to parse it on the fly. Therefore, it does not convert presentation HTML to CSS or anything similar. It does guarantee well-structured results: tags will wind up properly nested, default attributes will appear appropriately, and so on. The semantics of TagSoup are as far as practical those of actual HTML browsers. In particular, never, never will it throw any sort of syntax error: the TagSoup motto is "Just Keep On Truckin'". But there's much, much more. For example, if the first tag is LI, it will supply the application with enclosing HTML, BODY, and UL tags. Why UL? Because that's what browsers assume in this situation. For the same reason, overlapping tags are correctly restarted whenever possible: text like: This is <B>bold, <I>bold italic, </b>italic, </i>normal text gets correctly rewritten as: This is <b>bold, <i>bold italic, </i></b><i>italic, </i>normal text. By intention, TagSoup is small and fast. After release, I will spend some time making it faster if it turns out to be too slow. It does not depend on the existence of any framework other than SAX, and should be able to work with any framework that can accept SAX parsers. If your tag soup is not HTML, TagSoup can use a custom schema (written in Tag Soup Schema Language, a subset of RELAX NG compact syntax) instead of using the default HTML schema. You can also replace the low-level HTML scanner with one based on Sean McGrath's PYX format (very close to James Clark's ESIS format). You can also supply an AutoDetector that peeks at the incoming byte stream and guesses a character encoding for it. (Otherwise, the platform default is used. If someone supplies a good AutoDetector I may package it with later releases.) The presentation will focus on practical results: you will learn how to use TagSoup in its simple HTML mode, and get an idea of which features can be customized and how.
|

Cart



