[Home] [By Thread] [By Date] [Recent Entries]


On Jan 21, 2004, at 11:57 AM, jcowan@r... wrote:

>> The 'codePoint' typedef may be problematic:
>>
>>     // Unicode code points (4-byte int on most systems)
>>     typedef wchar_t codePoint;
>>
>> ...
> I have argued privately that wchar_t is in fact the Right Thing here
> despite its variability in size (UTF-32 on Unix platforms, UTF-16 on
> Windows), because it makes genx compatible with both standardized and
> non-standardized facilities, most especially "..."L strings.  Some
> conditional logic will be needed to interpret the input as UTF-16 or
> UTF-32, which can be based on sizeof(wchar_t).  Hypothetical platforms
> where sizeof(wchar_t) == 1 can be neglected.

Almost.  How about we leave it as wchar_t, but *not* UTF-16, so a value  
that's in a surrogate block is an error.  Then we change the name from  
codePoint (which could be interpreted as meaning "UTF-16 Code Point" to  
something more explicit like

numericValueCorrespondingToAUnicodeCharacterAsInUPlusFourHexDigitsIsThat 
Clear

John Cowan has suggested that "codeUnit" might be a good name, I'd be  
inclined to "uniChar", any other ideas?

If someone wants to put a generic UTF-16 processor on top of genx, that  
would be fine.  I don't see the demand for supporting it at the input  
end of genx because the UTF-16 centric languages like Java and C# have  
decent xml-writing software already. -Tim


  • Follow-Ups:
  • References:
    • Genx
      • From: Tim Bray <tbray@t...>
    • Re: Genx
      • From: Joe English <jenglish@f...>
    • Re: Genx
      • From: jcowan@r...
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member