Table of contentsAppendices |
5.8 UnicodeUnicode BIDI ProcessingThe characters in certain scripts are written horizontally from right to left. In some documents, in particular those written with the Arabic or Hebrew script, and in some mixed-language contexts, text in a single (visually displayed) block may appear with mixed directionality. This phenomenon is called bidirectionality, or "BIDI" for short. The Unicode standard [UNICODE] defines a complex algorithm, the Unicode BIDI algorithm [UNICODE-TR9] , for determining the proper directionality of text. The algorithm is based on both an implicit part based on character properties, as well as explicit controls for embeddings and overrides. The final step of refinement uses this algorithm and the Unicode bidirectional character type of each character to convert the implicit directionality of the text into explicit markup in terms of formatting objects. For example, a sub-sequence of Arabic characters in an otherwise English paragraph would cause the creation of an inline formatting object with the Arabic characters as its content, with a "direction" property of "rtl" and a "unicode-bidi" property of "bidi-override". The formatting object makes explict the previously implicit right to left positioning of the Arabic characters. As defined in [UNICODE-TR9] , the Unicode BIDI algorithm takes a stream of text as input, and proceeds in three main phases:
The algorithm, as described above, requires some adaptions to fit into the XSL processing model. First, the final, text reordering step is not done during refinement. Instead, the XSL equivalent of re-ordering is done during formatting. The inline-progression-direction of each glyph is used to control the stacking of glyphs as described in [area-stackcon] . The inline-progression-direction is determined at the block level by the "writing-mode" property and within the inline formatting objects within a block by the "direction" and "unicode-bidi" properties that were either specified on inline formatting objects generated by tree construction or are are on inline formatting objects introduced by this step of refinement (details below). Second, the algorithm is applied to a sequence of characters coming from the content of one or more formatting objects. The sequence of characters is created by processing a fragment of the formatting object tree. A fragment is any contiguous sequence of children of some formatting object in the tree. The sequence is created by doing a pre-order traversal of the fragment down to the fo:character level. During the pre-order traversal, every fo:character formatting object adds a character to the sequence. Furthermore, whenever the pre-order scan encounters a node with a "unicode-bidi" property with a value of "embed" or "override", add a Unicode RLO/LRO or RLE/LRE character to the sequence as appropriate to the value of the "direction" and "unicode-bidi" properties. On returning to that node after traversing its content, add a Unicode PDF character. In this way, the formatting object tree fragment is flattened into a sequence of characters. This sequence of characters is called the flattened sequence of characters below. Third, in XSL the algorithm is applied to delimited text ranges instead of just paragraphs. A delimited text range is a maximal flattened sequence of characters that does not contain any delimiters. Any formatting object that generates block-areas is a delimiter. It acts as a delimiter for its content. It also acts as a delimiter for its parent's content. That is, if the parent has character content, then its children formatting objects that generate block-areas act to break that character content into anonymous blocks each of which is a delimited text range. In a similar manner, the fo:multi-case formatting object acts as delimiter for its content and the content of its parent. Finally, text with an orientation that is not perpendicular to the dominant-baseline acts as a delimiter to text with an orientation perpendicular to the dominant-baseline. We say that text has an orientation perpendicular to the dominant-baseline if the glyphs that correspond to the characters in the text are all oriented perpendicular to the dominant-baseline. NOTE: For each delimited text range, the inline-progression-direction of the nearest ancestor (including self) formatting object that generates a block-area determines the paragraph embedding level used in the Unicode BIDI algorithm. This is the default embedding level for the delimited text range. Embedding levels are numbers that indicate how deeply the text is nested, and the default direction of text on that level. The minimum embedding level of text is zero, and the maximum embedding level is level 61. Having more than 61 embedding levels is an error. An XSL processor may signal the error. If it does not signal the error, it must recover by allowing a higher maximum number of embedding levels. The second step of the Unicode BIDI algorithm labels each character in the delimited text range with a resolved embedding level. The resolved embedding level of each character will be greater than or equal to the paragraph embedding level. Right-to-left text will always end up with an odd level, and left-to-right and numeric text will always end up with an even level. In addition, numeric text will always end up with a higher level than the paragraph level. Once the resolved embedding levels are determined for the delimited text range, new fo:bidi-override formatting objects with appropriate values for the "direction" and "unicode-bidi" properties are inserted into the formatting object tree fragment that was flattened into the delimited text range such that the following constraints are satisfied:
|