[Home] [By Thread] [By Date] [Recent Entries]
On Thu, Dec 09 2010 05:56:24 +0000, liam@w... wrote: > On Thu, 2010-12-09 at 00:37 -0500, Michael Sokolov wrote: ... >> One more mini-addition: would it be possible to have parsers ignore the >> BOM at the start of a UTF-8 file? Some editors seem to insist on >> creating them, they are allowed by the UTF-8 spec, and probably ought to >> be considered external to the actual file content. Also, maybe if we're The definition of the BOM/ZWNBS, the role of the BOM with UTF-8, and the prominence of UTF-8 in the Unicode Standard has changed over time with successive versions of the Unicode Standard [2]. The discussion of detecting character encoding has also changed over time in successive editions of XML 1.0. You could review UTF-8 and BOM on the basis that much has changed since the first XML 1.0 spec. >> going to allow multiple root elements we could also allow whitespace in >> the prolog? People often put it there, and it seems like something >> that could be tolerated easily enough. > > I have always felt it was a bug in the XML spec that the XML declaration > becomes a regular processing instruction if there's a blank line in > front of it. It makes it usable as a file signature for the OS. (If "<?xml" seems a bit much, try EPUB, where you have to read the first 50+ bytes of a Zip archive file [1].) ... >> On restriction to UTF-8 (16 if we insist, but really do folks store >> *files* as UTF-16?) > > Yes. Frequently. > >> : is this really a problem for non-western >> languages? > > If you manufacture memory and hard drives, then utf-8 is truly > delightful in countries where most characters will be 3 or more > bytes/octets in length in utf-8. Liam's roundabout way of saying YMMV. > It's also a common misconception that Unicode is a 16-bit character set; > it defines more than 65536 characters, and "surrogate pairs" in > languages like Java make utf16 as complex as utf8; processing characters Easier, probably, since you don't have surrogate pairs in UTF-8. > in either utf-8 or ucs-32 are the most common choices outside the Java > world as far as I can tell. Regards, Tony Graham Tony.Graham@M... Director W3C XSL FO SG Invited Expert Menteith Consulting Ltd XML Guild member XML, XSL and XSLT consulting, programming and training Registered Office: 13 Kelly's Bay Beach, Skerries, Co. Dublin, Ireland Registered in Ireland - No. 428599 http://www.menteithconsulting.com -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- xmlroff XSL Formatter http://xmlroff.org xslide Emacs mode http://www.menteith.com/wiki/xslide Unicode: A Primer urn:isbn:0-7645-4625-2 [1] Section 4 in http://www.idpf.org/ocf/ocf1.0/download/ocf10.htm [3] http://inasmuch.as/2007/10/03/bom-in-utf-8-good-bad-or-ugly/
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |

Cart



