[Home] [By Thread] [By Date] [Recent Entries]
On Wed, 2 Nov 2005, Philippe Poulard wrote: > Elliotte Harold wrote: > > Rick Jelliffe wrote: > > > >> For CJK (Chinese, Japanese, Korean) XML documents, where three (or six) > >> bytes may be used by UTF-8 instead of UCS-16's two (or four), UTF-16 > >> files > >> will usually be smaller. > > > > > > First a correction: UTF-8 never uses six bytes for anything. The largest > > UTF-8 character you'll ever see is 4 bytes wide. > > > > hi, > > I read somewhere that : > > UTF-8 uses 6 bytes for ISO/IEC 10646 > UTF-8 uses 4 bytes for Unicode > > Unicode is a subset of ISO/IEC 10646 (in terms of addressing) > ISO/IEC 10646 is a subset of Unicode (in terms of semantic) > > XML uses Unicode 10646 reserves the codes U+D800..U+DFFF for use in pairs to address characters with codes up to 20-bits long (U-00010000..U-0010FFFF). These reserved values (U+D800..U+DFFF) get encoded at 3 bytes each in UTF-8 so it takes 6 bytes to address the values 17 to 20 bits long via the 10646 scheme. However, UTF-8 can encode the UNICODE values U-00010000..U-0010FFFF as 4 bytes. <http://czyborra.com/utf/> explains some of the details. Chris Gray University of Waterloo Library
|

Cart



