Subject: RE: MSXML and Encoding
From: Ian Brockbank <ian@xxxxxxxxxxxxxx>
Date: Wed, 8 Sep 1999 17:20:37 +0100
|
Hi Steven,
> Very strange.
> Characters such as e or e do not get parsed.
>
> Eg. This fails, giving a ?
>
> <?xml version='1.0' encoding='UTF-8'?>
> <root>
> e
> </root>
>
> giving the reason
> An Invalid character was found in text content. Line 3, Position 1
> ?</root>
Let's look at this in hex:
3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 27 31 <?xml version='1
2e 30 27 20 65 6e 63 6f 64 69 6e 67 3d 27 55 54 .0' encoding='UT
46 2d 38 27 3f 3e 0d 0a 3c 72 6f 6f 74 3e 0d 0a F-8'?>..<root>..
e8 0d 0a 3c 2f 72 6f 6f 74 3e 0d 0a ...</root>..
Remember the table:
> UTF-8 mapping UCS-2 char
> ------------- ----------
> 0nnnnnnn 0x0000-0x007f
> 110nnnnn 10nnnnnn 0x0080-0x03ff
> 1110nnnn 10nnnnnn 10nnnnnn 0x0400-0xffff
All is fine for the first 3 lines of hex - everything is less than
0x80, so it corresponds to itself.
Then we hit the e. This is 0xe8 or 11101000. This is interpreted
as the start of a 3-byte character of the form 1000nnnn nnnnnnnn,
where the next two characters are of the form 10nnnnnnnn. What's
the next character in the file? 0d (carriage return), or 00001101.
That doesn't start with 10, so something's gone wrong with the UTF-8
encoding. So the processor gives an error.
If you want e in your document, you have to encode it into UTF-8 as
2-byte with nnnnn nnnnnn corresponding to 000 11101000 (e8), ie
11000011 10101000 - c3 a8 or A? Alternatively you could use the
entity è (assuming this is defined in your DTD - see previous
discussions).
Any clearer?
Cheers,
Ian
--
Ian Brockbank, Indigo Active Vision Systems, The Edinburgh Technopole,
Bush Loan, Edinburgh EH26 0PJ Tel: 0131-475-7234 Fax: 0131-475-7201
work: ian@xxxxxxxxxxxxxx personal: Ian.Brockbank@xxxxxxxxxxx
web: ScottishDance@xxxxxxxxxxx http://www.scottishdance.net/
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
|