[Home] [By Thread] [By Date] [Recent Entries]

  • From: "Waters, Michael, Springer US" <Mike.Waters@s...>
  • To: "XML Developers List" <xml-dev@l...>
  • Date: Sat, 12 Mar 2011 18:02:09 -0500

I found some related material in the list archives, but I wanted to check my understanding of the use of C1 characters in XML 1.0 and in HTML 4.

 

We have a UTF-8 encoded XML document that has gone through a number of conversions and import/export routines into/out of a CMS. At all times, the XML document was valid against the DTD, and in Oxygen everything seems fine. No errors were reported in the workflow until a late stage, where in rendering to HTML Saxon reported:

 

   net.sf.saxon.trans.DynamicError: Illegal HTML character: decimal 146

 

I traced the error to an article title, where there was an embedded hex character reference:

 

   Language rights versus speakers&#x0092; rights

 

Unicode character U+0092 is given as a control character in a private use area. I can’t see our vendor or any workflow step (un)intentionally adding that character. About the only thing that makes sense to me is that at some point (probably the source document), Windows-1252 encoding was used, where decimal 146 is, I think, a right single quote. (Whether that’s the appropriate character in this case is another matter.)

 

So, in all the XML processes, character U+0092 was passed through as legal, but in outputting to HTML it is illegal? I’m missing something here, surely.

 

Curiously, in my readings, HTML 5 seems to be special-casing Windows-1252 encoding, along with UTF-8, in that it must be supported:

 

http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#character-encodings-0

 

Best regards,

Mike Waters

 



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member