Re: There is a serious amount of character encoding conversion

Cart

XML Editor - Download a Free Trial >

See What's New >

Buy Now >

[Home] [By Thread] [By Date] [Recent Entries]

From: Michael Sokolov <sokolov@i...>
To: "Costello, Roger L." <costello@m...>
Date: Fri, 28 Dec 2012 11:34:52 -0500

On 12/28/2012 9:01 AM, Costello, Roger L. wrote:
> How did it find a match?
>
> The underlying byte sequence for the iso-8859-1 López is: 4C F3 70 65 7A (one byte -- F3 -- is used to encode ó).
>
> The underlying byte sequence for the UTF-8 López is: 4C C3 B3 70 65 7A (two bytes -- C3 B3 -- are used to encode ó).
>
> The search application cannot be doing a byte-for-byte match, else it would find no match.
>
> The codepoint for the UTF-8 ó character is F3.
>
> Hey, iso-8859-1 uses F3 to encode ó.
>
> So perhaps the search application is converting the UTF-8 bytes to codepoints and then comparing those codepoints to the iso-8859-1 bytes. That would result in a match.
>
One point of comparison: Lucene used to use Java characters internally  
(which are much like UTF-16), and now uses UTF-8 internally (not 
codepoints).  I think it's unlikely that your search application is 
using iso-8859-1 internally, although it might be using codepoints, as 
you suggest.  Of course it's no accident that iso-8859-1=Unicode 
codepoint; that was one sensible thing done by the character encoding gurus.

-Mike

References:
- There is a serious amount of character encoding conversionsoccurring inside our computers and on the Web
  - From: "Costello, Roger L." <costello@m...>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >