[Home] [By Thread] [By Date] [Recent Entries]
Hi Ian,
We also found word processor named styles to be used too inconsistently to be useful for list nesting. Only in very few workflows it is possible to demand properly marked up lists without any indentation/numbering overrides. It is not uncommon that authors prefer, for example, the (a), (b), (c) listing that a templates offers for level 2 lists over the 1., 2., 3. listing that the templates offers for level 1 lits. Since understanding and configuring word processorsb list settings is not that easy, authors sometimes manually change the list item indentations in order to convey the desired visual appearance: (a) This is a "List 2" item (b) This is another "List 2" item 1. This is actually a "List 1" item 2. This is another "List 1" item (c) This is another "List 2" item Most of the times, we therefore rely on a bvisualb XSLT 2.0 list nester [1] that uses the equivalent of CSS margin-left and text-indent properties (from the intermediate DocBook/CSS-based normalization format that Wendell mentioned [2]). The first level has for example a margin-left of 18pt and a text-indent of -18pt. The next level has a margin-left of 36pt and a text-indent of -18pt. We then group adjacent list items that have the same amount of margin-left + text-indent, with some tolerance (1.5pt) allowed. We also take into account leading tabs and their widths. They are sometimes used in lieu of proper left margins in (short) list continuation paragraphs or in other more deeply nested items. The list marker content, in particular the numbering, will either be calculated according to the complex OOXML (or the less complex IDML) rules, or we will take into account the literal values that the author/typesetter chose to use. Sometimes there is even a mix of calculated and verbose list numbers within the same list. Then we try to determine a coherent list type (lower alpha, arabic, bullets, b&) for a given nesting section. If no list type may be determined, we will turn it into a definition list. The whole multi-pass XSLT process is orchestrated by an XProc pipeline [3]. It may be customized by importing the XSLT and supplying the customized XSLT to the pipeline on the stylesheet port. I recently estimated [4] that the heuristic visual nesting took approx 300 hours to implement (with some iterations), the OOXML list number calculation took some 240 hours, and the IDML list number calculation took ~60 hours. So what Graydon said is true: You can hack a docx converter that does 80% of the work in a week, but then you need to rely on named styles, among other restrictions. Gerrit [1] https://github.com/transpect/evolve-hub/tree/master/lists-by-indent/xsl [2] http://archive.xmlprague.cz/2013/presentations/Conveying_Layout_Information_with_CSSa/CSSa_xmlprague_gimsieke.html [3] https://github.com/transpect/evolve-hub/blob/master/xpl/evolve-hub_lists-by-indent.xpl [4] https://twitter.com/letexml/status/1045224789097492480 On 29.10.2018 22:04, ian.proudfoot@xxxxxxxxxxx wrote: Agreed Wendell and Graydon. I am already doing multiple passes to get the content in a suitable state to do the nesting part. I find that most word processed text is in a poor state for easy conversion to good XML that is valid to a specific schema. When based simply on paragraph and character style names the end result is often unusable. So I use temporary attributes that encode the important stylistic overrides - capturing what the author was trying to achieve. I have been very pleased with the results.
Registergericht / Commercial Register: Amtsgericht Leipzig Registernummer / Registration Number: HRB 24930 GeschC$ftsfC<hrer / Managing Directors: Gerrit Imsieke, Svea Jelonek, Thomas Schmidt
|

Cart



