Subject: RE: How to Handle Bad XML (or Word HTML)
From: "Joshua Allen" <joshuaa@xxxxxxxxxxxxx>
Date: Tue, 11 Mar 2003 13:57:24 -0800
|
The best bet is to use HTML Tidy to tidy it up:
http://tidy.sourceforge.net
Tidy even has a mode for specifically for MS-Word.
Also note that Word in Office 11 (currently in Beta 2) supports
round-tripping of documents as well-formed XML.
> -----Original Message-----
> From: Ted Stresen-Reuter [mailto:tedmasterweb@xxxxxxx]
> Sent: Tuesday, March 11, 2003 1:37 PM
> To: xsl-List@xxxxxxxxxxxxxxxxxxxxxx
>
> Hi,
>
> Thanks again to everyone who answers on this list. You've all been
> really sweet.
>
> Today's question hopes to try and tackle a transformation of the HTML
> produced by MS Word into a valid XHTML format.
>
> In general, the problem is Word doesn't produce "valid" XML
> (specifically, for many elements, attributes are not quoted). The file
> I'm working with starts with the following:
>
> <html xmlns:o="urn:schemas-microsoft-com:office:office"
> xmlns:w="urn:schemas-microsoft-com:office:word"
> xmlns="http://www.w3.org/TR/REC-html40">
>
> Additionally, a typical element might look like this:
>
> <p class=MsoNormal style='text-align:justify;mso-hyphenate:none'><![if
> !supportEmptyParas]> <![endif]><o:p></o:p></p>
>
> Is it even possible to use such a document as a source document and if
> so, how do I handle errors returned by the XSLT processor when
unquoted
> attributes are found?
>
> Thanks again to all of you who take the time to read and actually
> answer these queries.
>
> Sincerely,
>
> Ted Stresen-Reuter
>
>
> XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
>
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
|