Stylus Studio XML Editor

Table of contents

Appendices

9 Internationalized Resource Identifiers (IRIs)

Internationalized Resource Identifiers (IRIs)

Work is currently in progress to produce an RFC defining Internationalized Resource Identifiers (IRIs). Since this work is not yet complete, this section gives a syntactic definition of IRIs for the purposes of this specification. The XML Core Working Group expects to issue an erratum replacing this section with a reference to the RFC when it is published.

Users defining namespaces are advised to restrict namespace names to URIs until the RFC is published and software supporting IRIs is in common use. Implementors are likewise advised not to reject namespace names that violate the drafts in terms of the allowed characters.

For a more general definition and discussion of IRIs see [IRIdraft5] (work in progress).

URI references are restricted to a subset of the ASCII characters; IRI references allow most Unicode characters from #xA0 onwards. Earlier drafts of the IRI RFC (eg [IRIdraft3] ) also allowed some of the disallowed ASCII characters, but the current draft ( [IRIdraft5] ) does not.

The additional characters allowed in IRIs by [IRIdraft5] are:

  • the Unicode plane 0 characters #xA0 - #xD7FF, #xF900-#xFDCF, #xFDF0-#xFFEF

  • the Unicode plane 1-14 characters #x10000-#x1FFFD ... #xD0000-#xDFFFD, #xE1000-#xEFFFD

An IRI reference is a string that can be converted to a URI reference by applying the following steps:

  1. Convert the hostname part, if present, using the ToASCII operation specified in Section 4.1 of [IDNA] with the flags UseSTD3ASCIIRules and AllowUnassigned set to TRUE.

  2. Escape all [additional characters] as follows:

    1. Each additional character is converted to UTF-8 [UTF8] as one or more bytes.

    2. The resulting bytes are escaped with the URI escaping mechanism (that is, converted to %HH, where HH is the hexadecimal notation of the byte value).

    3. The original character is replaced by the resulting character sequence.

NOTE: 

The algorithm in [IRIdraft5] includes a UCS normalization step, but this makes no difference to which strings are IRI references.