Re: Re: UTF's considered best practice [was: Re: next

Cart

XML Editor - Download a Free Trial >

See What's New >

Buy Now >

[Home] [By Thread] [By Date] [Recent Entries]

From: rjelliffe <rjelliffe@a...>
To: Uche Ogbuji <uche@o...>
Date: Fri, 10 Dec 2010 15:37:27 +1100

 On Thu, 9 Dec 2010 08:28:22 -0700, Uche Ogbuji <uche@o...> wrote:
> On Wed, Dec 8, 2010 at 11:31 PM, Jim DeLaHunt  wrote:
>
>  Sometimes UTF-16 is a more compact representation, sometimes UTF-8
> is. It depends on the frequency distribution of characters in the
> document. But they have equivalent descriptive power; either can
> represent any sequence of Unicode characters. Â If nextml adopts
> UTF-16, be aware that it can be serialised to bytes in either
> little-endian or big-endian order (UTF-16LE or UTF-16BE), so nextml
> should account for those possibilities. It should also allow for the
> special Byte-Order Mark character (BOM), which is used to distinguish
> the two.
>
> Thanks for all the great links and references.Â  That backs up my
> suspicion that supporting a diversity of encodings is a matter of 
> less
> urgency than it was when XML 1.0 was born.

 I would be careful about taking what Unicode conference speakers say as
 necessarily being authoratative rather than aspirational! But I 
 probably
 do agree with them. :-)

 It would be great if the XML interlude has swept all the old encodings 
 away
 and ushered in a Unicode-only world, but it is a gamble (a gamble worth
 taking, I think.) Does it cost that much? The issue that Windows* & 
 Java  APIs
 have default encodings based on locale and language decisions* still 
 remains.

 In 1997, a good argument against only allowing UTF-* was that people 
 needed
 an on-ramp to Unicode. So supporting other encodings was a way of 
 neutralizing
 the problem: paradoxically, supporting other encodings was a way of 
 promoting
 Unicode. (With Perl being a big win here: I remember reading that
 it finally moved to Unicode because XML pushed things past the tipping 
 point.)

 In 2010, perhaps that argument is not needed: the pro-active thing 
 might be
 to provide off-ramps from the legacy encodings. New formats only in
 UTF-8 might be the better idea.

> As for the BOM, yes, that should be key in any XML successor, as it 
> is
> in XML 1.0 itself.Â  In XML 1.0, you can tell the encoding even if
> it's not in the XML declaration because if not, it must either be
> UTF-8 (if there is no BOM), or UTF-8, UTF-16LE, UTF-16BE, etc.
> depending on BOM.
>
> If it's OK to say UTF only (and we banish the standalone declaration)
> , then there is no need for an explicit encoding declaration beside
> optional BOM.

 Magic numbers are still useful.

 Cheers
 Rick Jelliffe

 * 
 http://stackoverflow.com/questions/927652/why-encoding-default-getbytes-returns-different-results-in-vb-net-and-c

References:
- nextml
  - From: Amelia A Lewis <amyzing@t...>
- Re: nextml
  - From: Uche Ogbuji <uche@o...>
- Re: UTF's considered best practice [was: Re: nextml]
  - From: Uche Ogbuji <uche@o...>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >