Re: Use of UTF-8 and UTF-16

Cart

XML Editor - Download a Free Trial >

See What's New >

Buy Now >

[Home] [By Thread] [By Date] [Recent Entries]

To: Philippe Poulard <Philippe.Poulard@s...>
Subject: Re: Use of UTF-8 and UTF-16
From: Chris Gray <cpgray@l...>
Date: Wed, 2 Nov 2005 12:43:28 -0500 (EST)
Cc: Elliotte Harold <elharo@m...>, Rick Jelliffe <rjelliffe@a...>, Xml-Dev <xml-dev@l...>
In-reply-to: <4368C783.3070008@s...>
References: <NBBBIBMKFOFCNEBAKDPLCEKIKDAA.xml-dev@b...><39325.203.51.20.11.1130759065.squirrel@i...><43660951.3090707@m...> <4368C783.3070008@s...>

On Wed, 2 Nov 2005, Philippe Poulard wrote:

> Elliotte Harold wrote:
> > Rick Jelliffe wrote:
> >
> >> For CJK (Chinese, Japanese, Korean) XML documents, where three (or six)
> >> bytes may be used by UTF-8 instead of UCS-16's two (or four), UTF-16
> >> files
> >> will usually be smaller.
> >
> >
> > First a correction: UTF-8 never uses six bytes for anything. The largest
> > UTF-8 character you'll ever see is 4 bytes wide.
> >
>
> hi,
>
> I read somewhere that :
>
> UTF-8 uses 6 bytes for ISO/IEC 10646
> UTF-8 uses 4 bytes for Unicode
>
> Unicode is a subset of ISO/IEC 10646 (in terms of addressing)
> ISO/IEC 10646 is a subset of Unicode (in terms of semantic)
>
> XML uses Unicode

10646 reserves the codes U+D800..U+DFFF for use in pairs to address
characters with codes up to 20-bits long (U-00010000..U-0010FFFF).  These
reserved values (U+D800..U+DFFF) get encoded at 3 bytes each in UTF-8 so
it takes 6 bytes to address the values 17 to 20 bits long via the 10646
scheme.  However, UTF-8 can encode the UNICODE values
U-00010000..U-0010FFFF as 4 bytes.

<http://czyborra.com/utf/> explains some of the details.

Chris Gray
University of Waterloo Library

References:
- Re: Use of UTF-8 and UTF-16
  - From: Philippe Poulard <Philippe.Poulard@s...>

Prev by Date: Xml2PDF version 2.4 is released
Next by Date: RE: RE: description of the logical or semantic structure
Previous by thread: Re: Use of UTF-8 and UTF-16
Next by thread: RE: RE: description of the logical or semantic structure
Index(es):
- Date
- Thread

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >