[Home] [By Thread] [By Date] [Recent Entries]


On Aug 13, 2005, at 14:19, Alan Gutierrez wrote:

>     Am I seeing that with Unicode in Java, you need to work with
>     String and not with individual char? That puts a dent in my
>     algorithm, which advanced along the characters in the string.

It depends on what exactly you are doing. A Java char is not a Unicode 
character but a UTF-16 code unit. The values \u0000 and \uFFFF should 
never occur in XML and can be used as sentinels if your algorithm works 
on UTF-16 code units. For the purpose of indexing text, working on 
UTF-16 code units as opposed to working on Unicode characters may well 
be good enough. In that case, a surrogate pair can be treated as two 
adjacent "characters". (Note that even when operating on UTF-32, you 
can have tightly-coupled characters when there is a base character 
followed by combining marks, so working on Unicode characters does not 
buy you inter-character independence.)

-- 
Henri Sivonen
hsivonen@i...
http://hsivonen.iki.fi/


Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member