Appendices

4 Phases of Serialization

Phases of Serialization

Following the sequence normalization process described in [serdm], serialization can be regarded as involving four three phases of processing.

For an implementation-defined output method, any of these phases MAY be skipped or MAY be performed in a different order than is specified here. For the output methods defined in this specification, these phases are carried out sequentially as follows:

Markup generation produces the character representation of start and end tags for elements, and other constructs such as XML declarations, processing instructions, and so on. This is influenced by the parameters method, doctype-system, doctype-public, include-content-type, indent, omit-xml-declaration, standalone, undeclare-namespaces and version. those parts of the serialized result that describe the structure of the normalized sequence. In the cases of the XML, HTML and XHTML output methods, this phase produces the character representations of the following:
- addGthe document type declaration;
- addGstart tags and end tags (except for attribute values, whose representation is produced by the character expansion phase);
- addGprocessing instructions; and
- addGcomments.
addGIn the cases of the XML and XHTML output methods, this phase also produces the following:
- addGthe XML or text declaration; and
- addGempty element tags (except for the attribute values);
In the case of the text output method, this phase has no effect.
Character expansion is concerned with the representation of characters appearing in text and attribute nodes in the normalized sequence. The substitution processes that apply are listed below, in priority order: a character that is handled by one process in this list will be unaffected by processes appearing later in the list, except that a character affected by Unicode normalization MAY be affected by creation of CDATA sections and by character escaping:
- URI escaping (in the case of URI-valued attributes in the HTML and XHTML output methods), as determined by the escape-uri-attributes parameter
- Character mapping, as determined by the use-character-maps parameter. Text nodes that are children of elements specified by the cdata-section-elements parameter are not affected by this step.
- Unicode normalization, if requested by the delEnormalize-unicode addEnormalization-form parameter. Unicode normalization is applied to the character stream that results after all markup generation and character expansion has taken place.
  
  addEFor the definitions of the various normalization forms, see [CHARMOD]
  
  addEThe meanings associated with the possible values of the normalization-form parameter are as follows:
  - NFC specifies the serialized result will be in Unicode Normalization Form C.
  - NFD specifies the serialized eenult will be in Unicode Normalization Form D.
  - NFKC specifies the serialized result will be in Unicode Normalization Form KC.
  - NFKD specifies the serialized result will be in Unicode Normalization Form KD.
  - fully-normalized specifies the serialized result will be in fully normalized form.
  - none specifies that no Unicode normalization will be applied.
  - An implementation-defined value has an implementation-defined effect.
  delG
  NOTE:
  addFAny characters produced under the effect of the use-character-maps parameter are not subject to Unicode normalization. If the normalization-form parameter has a value other than none and the use-character-maps parameter is not empty, the whole of the serialized document MAY NOT be in the normalization form specified by the normalization-form parameter.
- Creation of CDATA sections, as determined by the cdata-section-elements parameter. Note that this is also affected by the encoding parameter, in that characters not present in the selected encoding cannot be represented in a CDATA section.
- Escaping according to XML or HTML rules of special characters that cannot be represented in the selected encoding. For example replacing < with <
Encoding, as controlled by the encoding parameter, This converts the character stream produced by the previous phases into a byte stream.

NOTE:
addESerialization is only defined in terms of encoding the result as a stream of bytes. However, a processor serializer MAY provide an option that allows the encoding phase to be skipped, so that the result of serialization is a stream of Unicode characters. The effect of any such option is implementation-defined, and a processor serializer is not REQUIRED to support such an option.

[Next Chapter] [Home]

Table of contents

Appendices

4 Phases of Serialization