[Home] [By Thread] [By Date] [Recent Entries]
On 14/01/2022 02:44, Rick Jelliffe wrote: ... I am interested to know what gotchas people have found in real deployments, in the last 20 years, with XML with non-ASCII data and markup. And also, whether modern Unicode is actually good enough now It should go without saying that you need a good formatter to produce a good result... Unicode can represent the characters, of course, but formatting also requires knowledge of things like line breaking, hyphenation, and when to form contextual glyphs. The Unicode Bidirectional Algorithm [3] makes it easier to consistently handle text in different directions. That, of course, is based on properties assigned to each Unicode character, not just on their glyph or language. A lot of software uses International Components for Unicode (ICU) [4] libraries in either C/C++ or Java to handle a lot of what Unicode defines. The details for how to format a script can be hard to come by. Users of a particular script can give good hints when they find a problem with what you are currently producing. The W3C did a ground-breaking effort when they produced Requirements for Japanese Layout (JLReq) [5] that made details of Japanese formatting accessible to the rest of us. The next JLReq version looks set to be a 'digital native' version with some things simplified and some things promoted to being advanced options. (A bit like how MathML3 is becoming MathML Core [7] for the subset that browsers consent to implement; David Carlisle can correct me if this is a misrepresentation based on my view from the outside.) The JLReq concept has been copied/expanded to make a bunch of task forces for different languages [8], plus there's other ways to crowd-source information. [10][11] (Back in 2012 [9], it looked to me like Community Groups would be the way to do this, but the W3C, like most of us, experienced the gravitation pull of GitHub.) For example, are PUA characters used much in XML?, or is Unihan plus markup good enough, or do people need to embed actual glyph information? How are new ideographs handled when you cannot wait for the Unicode Consortium process? Is the situation different with JSON?I don't see non-standard characters in what I do. Modern font formats, such as OpenType, have largely/completely removed the need to think about glyph variants encoded in the Private Use area in fonts. OpenType defines a lot of font features [1] that are encoded in the font file as lookup tables, and the formatter and the font can work together to turn text into ligatures or replacement glyphs. Many people are familiar with there being a glyph for 'ffi', which is U+FB03, but a language that uses a different 'i', say 'ï', may also benefit from an 'ffï' ligature, which isn't in Unicode. An OpenType font could put an 'ffï' ligature in the Private Use area or it could be in the font as an unencoded glyph, but you wouldn't need to know which because the ligature lookup would get you the right glyph if it exists. CSS gave more friendly names to OpenType's four-letter acronyms that you can use with CSS or XSL-FO [2], or you can access the font features directly. [12][13] Regards, Tony Graham. -- Senior Architect XML Division Antenna House, Inc. ---- Skerries, Ireland tgraham@a... [1] https://docs.microsoft.com/en-ie/typography/opentype/spec/features_ae [2] https://www.antenna.co.jp/AHF/help/en/ahf-ext.html#axf.font-variant [3] http://www.unicode.org/reports/tr9/ [4] https://icu.unicode.org/ [5] https://www.w3.org/TR/jlreq/ [6] https://github.com/w3c/jlreq/issues/281 [7] https://www.w3.org/TR/mathml-core/ [8] https://www.w3.org/International/i18n-drafts/nav/languagedev [9] See page 24 in http://mentea.net/resources/multilingualweb2012.pdf [10] https://w3c.github.io/type-samples/ [11] https://www.w3.org/International/i18n-activity/textlayout/ [12] https://www.antenna.co.jp/AHF/help/en/ahf-ext.html#axf.font-feature-settings [13] https://caniuse.com/font-feature
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |

Cart



