[Home] [By Thread] [By Date] [Recent Entries]
On Thu, 9 Dec 2010 08:28:22 -0700, Uche Ogbuji <uche@o...> wrote: > On Wed, Dec 8, 2010 at 11:31 PM, Jim DeLaHunt wrote: > > Sometimes UTF-16 is a more compact representation, sometimes UTF-8 > is. It depends on the frequency distribution of characters in the > document. But they have equivalent descriptive power; either can > represent any sequence of Unicode characters.  If nextml adopts > UTF-16, be aware that it can be serialised to bytes in either > little-endian or big-endian order (UTF-16LE or UTF-16BE), so nextml > should account for those possibilities. It should also allow for the > special Byte-Order Mark character (BOM), which is used to distinguish > the two. > > Thanks for all the great links and references. That backs up my > suspicion that supporting a diversity of encodings is a matter of > less > urgency than it was when XML 1.0 was born. I would be careful about taking what Unicode conference speakers say as necessarily being authoratative rather than aspirational! But I probably do agree with them. :-) It would be great if the XML interlude has swept all the old encodings away and ushered in a Unicode-only world, but it is a gamble (a gamble worth taking, I think.) Does it cost that much? The issue that Windows* & Java APIs have default encodings based on locale and language decisions* still remains. In 1997, a good argument against only allowing UTF-* was that people needed an on-ramp to Unicode. So supporting other encodings was a way of neutralizing the problem: paradoxically, supporting other encodings was a way of promoting Unicode. (With Perl being a big win here: I remember reading that it finally moved to Unicode because XML pushed things past the tipping point.) In 2010, perhaps that argument is not needed: the pro-active thing might be to provide off-ramps from the legacy encodings. New formats only in UTF-8 might be the better idea. > As for the BOM, yes, that should be key in any XML successor, as it > is > in XML 1.0 itself. In XML 1.0, you can tell the encoding even if > it's not in the XML declaration because if not, it must either be > UTF-8 (if there is no BOM), or UTF-8, UTF-16LE, UTF-16BE, etc. > depending on BOM. > > If it's OK to say UTF only (and we banish the standalone declaration) > , then there is no need for an explicit encoding declaration beside > optional BOM. Magic numbers are still useful. Cheers Rick Jelliffe * http://stackoverflow.com/questions/927652/why-encoding-default-getbytes-returns-different-results-in-vb-net-and-c
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |

Cart



