Stylus Studio XML Editor

Table of contents

Appendices

5.8 Unicode

Unicode BIDI Processing

The characters in certain scripts are written horizontally from right to left. In some documents, in particular those written with the Arabic or Hebrew script, and in some mixed-language contexts, text in a single (visually displayed) block may appear with mixed directionality. This phenomenon is called bidirectionality, or "BIDI" for short.

The Unicode standard [UNICODE] defines a complex algorithm, the Unicode BIDI algorithm [UNICODE-TR9] , for determining the proper directionality of text. The algorithm is based on both an implicit part based on character properties, as well as explicit controls for embeddings and overrides.

The final step of refinement uses this algorithm and the Unicode bidirectional character type of each character to convert the implicit directionality of the text into explicit markup in terms of formatting objects. For example, a sub-sequence of Arabic characters in an otherwise English paragraph would cause the creation of an inline formatting object with the Arabic characters as its content, with a "direction" property of "rtl" and a "unicode-bidi" property of "bidi-override". The formatting object makes explict the previously implicit right to left positioning of the Arabic characters.

As defined in [UNICODE-TR9] , the Unicode BIDI algorithm takes a stream of text as input, and proceeds in three main phases:

  1. Separation of the input text into paragraphs. The rest of the algorithm affects only the text between paragraph separators.

  2. Resolution of the embedding levels of the text. In this phase, the bidirectional character types, plus the Unicode directional formatting codes, are used to produce resolved embedding levels. The normative bidirectional character type for each character is specified in the Unicode Character Database [UNICODE-CD] .

  3. Reordering the text for display on a line-by-line basis using the resolved embedding levels, once the text has been broken into lines.

The algorithm, as described above, requires some adaptions to fit into the XSL processing model. First, the final, text reordering step is not done during refinement. Instead, the XSL equivalent of re-ordering is done during formatting. The inline-progression-direction of each glyph is used to control the stacking of glyphs as described in [area-stackcon] . The inline-progression-direction is determined at the block level by the "writing-mode" property and within the inline formatting objects within a block by the "direction" and "unicode-bidi" properties that were either specified on inline formatting objects generated by tree construction or are are on inline formatting objects introduced by this step of refinement (details below).

Second, the algorithm is applied to a sequence of characters coming from the content of one or more formatting objects. The sequence of characters is created by processing a fragment of the formatting object tree. A fragment is any contiguous sequence of children of some formatting object in the tree. The sequence is created by doing a pre-order traversal of the fragment down to the fo:character level. During the pre-order traversal, every fo:character formatting object adds a character to the sequence. Furthermore, whenever the pre-order scan encounters a node with a "unicode-bidi" property with a value of "embed" or "override", add a Unicode RLO/LRO or RLE/LRE character to the sequence as appropriate to the value of the "direction" and "unicode-bidi" properties. On returning to that node after traversing its content, add a Unicode PDF character. In this way, the formatting object tree fragment is flattened into a sequence of characters. This sequence of characters is called the flattened sequence of characters below.

Third, in XSL the algorithm is applied to delimited text ranges instead of just paragraphs. A delimited text range is a maximal flattened sequence of characters that does not contain any delimiters. Any formatting object that generates block-areas is a delimiter. It acts as a delimiter for its content. It also acts as a delimiter for its parent's content. That is, if the parent has character content, then its children formatting objects that generate block-areas act to break that character content into anonymous blocks each of which is a delimited text range. In a similar manner, the fo:multi-case formatting object acts as delimiter for its content and the content of its parent. Finally, text with an orientation that is not perpendicular to the dominant-baseline acts as a delimiter to text with an orientation perpendicular to the dominant-baseline. We say that text has an orientation perpendicular to the dominant-baseline if the glyphs that correspond to the characters in the text are all oriented perpendicular to the dominant-baseline.

NOTE: 

In most cases, a delimited text range is the maximal sequence of characters that would be formatted into a sequence of one or more line-areas. For the fo:multi-case and the text with an orientation perpendicular to the dominant-baseline, the delimited range may be a sub-sequence of a line or sequence of lines. For example, in Japanese formatted in a vertical writing-mode, rotated Latin and Arabic text would be delimited by the vertical Japanese characters that immediately surround the Latin and Arabic text. Any formatting objects that generated inline-areas would have no affect on the determination of the delimited text range.

For each delimited text range, the inline-progression-direction of the nearest ancestor (including self) formatting object that generates a block-area determines the paragraph embedding level used in the Unicode BIDI algorithm. This is the default embedding level for the delimited text range.

Embedding levels are numbers that indicate how deeply the text is nested, and the default direction of text on that level. The minimum embedding level of text is zero, and the maximum embedding level is level 61. Having more than 61 embedding levels is an error. An XSL processor may signal the error. If it does not signal the error, it must recover by allowing a higher maximum number of embedding levels.

The second step of the Unicode BIDI algorithm labels each character in the delimited text range with a resolved embedding level. The resolved embedding level of each character will be greater than or equal to the paragraph embedding level. Right-to-left text will always end up with an odd level, and left-to-right and numeric text will always end up with an even level. In addition, numeric text will always end up with a higher level than the paragraph level.

Once the resolved embedding levels are determined for the delimited text range, new fo:bidi-override formatting objects with appropriate values for the "direction" and "unicode-bidi" properties are inserted into the formatting object tree fragment that was flattened into the delimited text range such that the following constraints are satisfied:

  1. For any character in the delimited text range, the inline-progression-direction of the character must match its resolved embedding level.

  2. For each resolved embedding level L from the paragraph embedding level to the maximum resolved embedding level, and for each maximal contiguous sequence of characters S for which the resolved embedding level of each character is greater than or equal to L,

    1. There is an inline formatting object F which has as its content the formatting object tree fragment that flattens to S and has a "direction" property consistent with the resolved embedding level L.

      NOTE: 

      F need not be an inserted formatting object if the constraint is met by an existing formatting object or by specifying values for the "direction" and "unicode-bidi" properties on an existing formatting object.

    2. All formatting objects that contain any part of the sequence S are properly nested in F and retain the nesting relationships they had in the formatting object tree prior to the insertion of the new formatting objects.

      NOTE: 

      Satisfying this constraint may require splitting one or more existing formatting objects in the formatting object tree each into a pair of formatting objects each of which has the same set of computed property values as the original, unsplit formatting object. One of the pair would be ended before the start of F or start after the end of F and the other would start after the start of F or would end before the end of F, respectively. The created pairs must continue to nest properly to satisfy this constraint. For example, assume Left-to-right text is represented by the character "L" and Right-to-left text is represented by "R". In the sub-tree

       <fo:block>
         LL
         <fo:inline ID="A">LLLRRR</fo:inline>
         RR
       </fo:block>
      
      assuming a paragraph embedding level of "0", the resolved embedding levels would require the following (inserted and replicated) structure:
       <fo:block>
         LL
         <fo:inline ID="A">LLL</fo:inline>
         <fo:bidi-override direction="rtl">
           <fo:inline ID="A+">RRR</fo:inline>
           RR
         </fo:bidi-override>
       </fo:block>
      
      Note that the fo:inline with ID equal "A" has been split into two fo:inlines with the first one having the original ID of "A" and the second having an ID of "A+". Since ID's must be unique, the computed value of any ID or Key property must not be replicated in the second member of the pair. The value of "A+" was just used for illustrative purposes.

  3. No fewer fo:bidi-override formatting objects can be inserted and still satisfy the above constraints. That is, add to the refined formatting object tree only as many fo:bidi-override formatting objects, beyond the formatting objects created during tree construction, as are needed to represent the embedding levels present in the document.