[Home] [By Thread] [By Date] [Recent Entries]


At 1:36 PM +0000 11/21/03, Alaric B Snell wrote:

>People who use pound signs and accented characters, like us 
>Europeans, would see each such symbol taking 3 bytes, but they 
>currently take 2 bytes in UTF-8 and occur only occasionally 
>interspersed with US-ASCII characters anyway, so the hit would be 
>nowhere near as bad as the hit UTF-8 incurs for the Chinese and 
>their neighbours.
>

One should keep in mind that Chinese and similar languages are quite 
compressed to start with, far more so than English text is. For 
example, in UTF-8 the English word "tree" takes four bytes. The 
Japanese word for tree takes three bytes.  The English word "grove" 
takes five bytes. The Japanese word for grove takes three bytes. The 
English word "forest" takes six bytes. The Japanese word for forest 
still takes only three bytes. I don't know the Japanese word for 
antidisestablishmentarianism, but whatever it is, it's probably a lot 
smaller than the English one. Comparing alphabetic languages to 
ideographic ones is really apples to oranges. Word for word, Chinese 
documents tend to be smaller, even in UTF-8.
-- 

   Elliotte Rusty Harold
   elharo@m...
   Effective XML (Addison-Wesley, 2003)
   http://www.cafeconleche.org/books/effectivexml
   http://www.amazon.com/exec/obidos/ISBN%3D0321150406/ref%3Dnosim/cafeaulaitA

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member