Hi Gerrit,
On Tue, Oct 11, 2016 at 3:29 PM, Imsieke, Gerrit, le-tex
gerrit.imsieke@xxxxxxxxx <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:
> But do we know that the characters are just bytes?
>
> Sometimes UTF-8 is being read as if it were ISO-8859-1 or CP-1252 (which
> is more likely on Windows) and then saved as UTF-8. Then C"b,b" are 3
> (multibyte) UTF-8 characters.
>
> This is very similar to some of the advice that Liam shared with me; i.e.
something from a Windows server (I'm fairly sure that's the OS for the
application generating the $input.xml files) is reading UTF-8 and outputing
it as ISO-8859-1.
> If this is the case, you can correct it with
>
> iconv -t WINDOWS-1252 -f UTF-8 input.xml | sed -e 's/
> encoding="iso-8859-1"/ encoding="UTF-8"/' > output.xml
>
> :) now *this* is different. This replaces the ISO/CP-1252/... with U+FFFD,
which is arguably an improvement.
> Gerrit
Bridger
>
>
> On 11.10.2016 21:23, Wolfgang Laun wolfgang.laun@xxxxxxxxx wrote:
>
>> The characters E2 80 99 are the UTF-8 encoding of the Unicode character
>> RIGHT SINGLE QUOTATION MARK.
>>
>> Simply changing the ISO-8859-1 in your XML file to UTF-8 should fix this.
>>
>>
>> On 11 October 2016 at 21:00, Bridger Dyson-Smith bdysonsmith@xxxxxxxxx
>> <mailto:bdysonsmith@xxxxxxxxx> <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx
>>
>> <mailto:xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>> wrote:
>>
>> Hi all,
>>
>> I'm struggling with a character encoding issue (or a character
>> representation issue maybe?): I have input XML that looks like this
>>
>> input.xml
>> <?xml version="1.0" encoding="iso-8859-1"?>
>> <documents>
>> <document>The reality of the effect of natural ventilation in a
>> residential attic cavity has been the topic of many debates and
>> scholarly reports since the 1930C"b,b"s.</document>
>> </documents>
>>
>> and I would like to get it to a point where the characters are
>> represented properly, i.e.
>>
>> output.xml
>> <?xml version="1.0" encoding="UTF-8"?>
>> <documents>
>> <document>The reality of the effect of natural ventilation in a
>> residential attic cavity has been the topic of many debates and
>> scholarly reports since the 1930bs.</document>
>> </documents>
>>
>> Thanks to Liam's help on irc and reading through the list archives,
>> it seems like an identity transform should be the right step towards
>> getting the representation corrected, but something isn't working
>> (or I have a misunderstanding somewhere).
>>
>> If I apply the following identity transform with Saxon HE 9.6.0.7 in
>> oXygen 18:
>> <?xml version="1.0" encoding="UTF-8"?>
>> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform
>> <http://www.w3.org/1999/XSL/Transform>"
>> version="2.0">
>> <xsl:output encoding="UTF-8" indent="yes"/>
>> <xsl:template match="/"><xsl:copy-of select="/"/></xsl:template>
>> </xsl:stylesheet>
>>
>> I get the following result:
>> <?xml version="1.0" encoding="UTF-8"?>
>> <documents>
>> <document>The reality of the effect of natural ventilation in a
>> residential attic cavity has been the topic of many debates and
>> scholarly reports since the 1930C"€™s.</document>
>> </documents>
>>
>> Could someone provide some insight into what I've done wrong here?
>> Any help would be greatly appreciated.
>>
>> Best,
>> Bridger
>>
>> XSL-List info and archive <http://www.mulberrytech.com/xsl/xsl-list>
>> EasyUnsubscribe <-list/528976> (by email)
>>
>>
>> XSL-List info and archive <http://www.mulberrytech.com/xsl/xsl-list>
>> EasyUnsubscribe <-list/225679>
>> (by email <>)
>>
>
> --
> Gerrit Imsieke
> GeschC$ftsfC<hrer / Managing Director
> le-tex publishing services GmbH
> Weissenfelser Str. 84, 04229 Leipzig, Germany
> Phone +49 341 355356 110, Fax +49 341 355356 510
> gerrit.imsieke@xxxxxxxxx, http://www.le-tex.de
>
> Registergericht / Commercial Register: Amtsgericht Leipzig
> Registernummer / Registration Number: HRB 24930
>
> GeschC$ftsfC<hrer: Gerrit Imsieke, Svea Jelonek,
> Thomas Schmidt, Dr. Reinhard VC6ckler
> ------------------------------------------------------------
> ------------------
> Meet us at Frankfurt Book Fair:
> Hall 4.2, Stand L68.
> More info at http://www.le-tex.de/en/buchmesse.html
|