Subject: Re: Character 150 withs Windows-1252 output
From: "andrew welch" <andrew.j.welch@xxxxxxxxx>
Date: Fri, 21 Apr 2006 13:56:13 +0100
|
On 4/21/06, Michael Kay <mike@xxxxxxxxxxxx> wrote:
> > Why is it that #150 gets escaped when using Windows-1252
> > output encoding when it should contain that character?
>
> Because there is no character in the Windows-1252 character set that
> corresponds to the Unicode character with codepoint 150.
Yes, thanks. That makes sense now. The thing I'm struggling with now is
this:
This source XML:
<?xml version="1.0" encoding="Windows-1252" ?>
<foo>–</foo>
With this stylesheet:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output encoding="US-ASCII"/>
<xsl:template match="/">
<xsl:copy-of select="."/>
</xsl:template>
</xsl:stylesheet>
Gives this result:
<foo>––</foo>
I've checked the input file with a hex editor to make sure the
un-escaped dash really is 0x96. Somehow the two characters are
treated differently, which is something I didn't expect.
I think that 0x96 in the input XML read using Windows-1252 should
become #8211 when output using any encoding other than Windows-1252,
which is what is happening for the actual character 0x96, but the
character reference #150 gets serialised back as #150...
Any thoughts?
|