Re: generating DOM from ill-formed HTML docs

Cart

XML Editor - Download a Free Trial >

See What's New >

Buy Now >

[Home] [By Thread] [By Date] [Recent Entries]

To: xml-dev@l...
Subject: Re: generating DOM from ill-formed HTML docs
From: "Thomas B. Passin" <tpassin@c...>
Date: Sun, 14 Jul 2002 23:20:07 -0400
References: <20020715020006.44124.qmail@w...>

[Robert Mena]

> Hi, I am developing an application that will have to
> build a DOM tree of html pages.
>
> I'll use such DOM trees to perform some
> analysis/comparisons.
>
> Since most of the time I'll find ill-formed documents
> I'd like to know if there are any parsers out there
> that "accept" this flaws and builds the tree anyway.
>
> I've tried domxml (php) with no luck.

The usual answer is to preprocess with Tidy - see

http://www.w3.org/People/Raggett/tidy/

You may also want to look at NekoHTML, at

http://www.apache.org/~andyc/

This work processed html, including fixing up some problems, and uses the
Xerxes JNI so you can build  a DOM.

Cheers,

Tom P

Follow-Ups:
- Re: generating DOM from ill-formed HTML docs
  - From: Robert Mena <rt_mena@y...>

References:
- generating DOM from ill-formed HTML docs
  - From: Robert Mena <rt_mena@y...>

Prev by Date: generating DOM from ill-formed HTML docs
Next by Date: Re: generating DOM from ill-formed HTML docs
Previous by thread: generating DOM from ill-formed HTML docs
Next by thread: Re: generating DOM from ill-formed HTML docs
Index(es):
- Date
- Thread

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >