Re: Parsing Kanji (Japanese) characters...

Cart

XML Editor - Download a Free Trial >

See What's New >

Buy Now >

[Home] [By Thread] [By Date] [Recent Entries]

To: <xml-dev@l...>
Subject: Re: Parsing Kanji (Japanese) characters...
From: "Rick Jelliffe" <ricko@a...>
Date: Thu, 19 Jun 2003 17:33:16 +1000
References: <H0000cbb064190df@MHS>

From: <nizar.hirani@c...>
  
> Is the SAX Parser able to handle Kanji characters? Any help/pointers are
> appreciated.

The problem is probably that your document is encoded in an encoding that uses
escape sequences.   When it is read using a different encoding (e.g. the default
encoding of UTF-8) then the ESC character is correctly flagged as being
a problem. 

There are three main Japanese encodings in common use: ISO 2022, Shift JIS and 
EUC: all of these have various variants and extensions, and also documents can be in 
Unicode encodings, which also have variants.  It is a very good thing that XML 
systems can often detect that your data has been mislabelled, isn't it! Otherwise
if you add the wrong data to a database, that database will have been corrupted.

Your text is probably encoded using  ISO-2022-JP (JIS) encoding.

If you are working with Far Eastern data much, I recommend you read Ken
Lunde's  "Chinese Japanese Korean Vietnamese Information Processing"
from O'Reilly.  It is an amazing book.  

On the WWW see  http://lfw.org/text/jp.html#iso2022

Cheers
Rick Jelliffe

References:
- Parsing Kanji (Japanese) characters...
  - From: nizar.hirani@c...

Prev by Date: Re: modeling, validating and documenting an xml grammar
Next by Date: Re: modeling, validating and documenting an xml grammar
Previous by thread: Re: Parsing Kanji (Japanese) characters...
Next by thread: Fwd: JSR 173 - Streaming API for XML - Public Review Available
Index(es):
- Date
- Thread

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >