Table of contentsAppendices |
E.1 IntroductionIntroductionThis appendix is an informative, not a normative, part of the Level 3 DOM specification. Characters are represented in Unicode by numbers called code points (also called scalar values). These numbers can range from 0 up to 1,114,111 = 10FFFF16 (although some of these values are illegal). Each code point can be directly encoded with a 32-bit code unit. This encoding is termed UCS-4 (or UTF-32). The DOM specification, however, uses UTF-16, in which the most frequent characters (which have values less than FFFF16) are represented by a single 16-bit code unit, while characters above FFFF16 use a special pair of code units called a surrogate pair. For more information, see [Unicode] or the Unicode Web site.
While indexing by code points as opposed to code units is not
common in programs, some specifications such as [XPath10] (and therefore XSLT and [XPointer]) use code point indices. For
interfacing with such formats it is recommended that the
programming language provide string processing methods for
converting code point indices to code unit indices and back. Some
languages do not provide these functions natively; for these it is
recommended that the native NOTE: |