[Home] [By Thread] [By Date] [Recent Entries]

  • From: Hermann Stamm-Wilbrandt <STAMMW@d...>
  • To: "Costello, Roger L." <costello@m...>
  • Date: Sat, 29 Dec 2012 03:13:21 +0100

Roger,

running the modified file through an identity transform will result in
the error you searched for, see below. Reason is that "70" is not a
valid 2nd byte for UTF-8 encodings, these are of the form "10xxxxxx".
http://en.wikipedia.org/wiki/Utf-8#Description

But you do not have a guarantee that failure happens.
Take for example this two character sequence "ä", it is "C3 A4" if
encoded in ISO-8859-1. If you now do your "utf-8" encoding
modification experiment, then this two bytes will be interpreted as
valid UTF-8 two byte encoding of "ä" character.


$ od -Ax -tcx1 Lopez.modified.xml
000000   <   ?   x   m   l       v   e   r   s   i   o   n   =   "   1
        3c  3f  78  6d  6c  20  76  65  72  73  69  6f  6e  3d  22  31
000010   .   0   "       e   n   c   o   d   i   n   g   =   "   u   t
        2e  30  22  20  65  6e  63  6f  64  69  6e  67  3d  22  75  74
000020   f   -   8   "                       ?   >  \n   <   N   a   m
        66  2d  38  22  20  20  20  20  20  3f  3e  0a  3c  4e  61  6d
000030   e   >   L 363   p   e   z   <   /   N   a   m   e   >  \n
        65  3e  4c  f3  70  65  7a  3c  2f  4e  61  6d  65  3e  0a
00003f
$


$ xsltproc identity.xsl Lopez.modified.xml
Lopez.modified.xml:2: parser error : Input is not proper UTF-8, indicate
encoding !
Bytes: 0xF3 0x70 0x65 0x7A
<Name>L�pez</Name>
       ^
unable to parse Lopez.modified.xml
$
$ saxon-6.5.5 Lopez.modified.xml identity.xsl
Error at byte 10 of file:/home/stammw/Lopez/Lopez.modified.xml:
  Error reported by XML parser: bad continuation of multi-byte UTF-8
sequence (code: 0x70)
Transformation failed: Run-time errors were reported
$
$ xalan identity.xsl -IN Lopez.modified.xml

(Location of error unknown)XSLT Error
(javax.xml.transform.TransformerException):
com.ibm.xtq.common.utils.WrappedRuntimeException: An invalid XML character
(Unicode: 0xffffffff) was found in the element content of the document.
Exception in thread "main" java.lang.RuntimeException:
com.ibm.xtq.common.utils.WrappedRuntimeException: An invalid XML character
(Unicode: 0xffffffff) was found in the element content of the document.
	at org.apache.xalan.xslt.Process.doExit(Unknown Source)
	at org.apache.xalan.xslt.Process.main(Unknown Source)
$
$ cat identity.xsl
<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
>
  <xsl:output method="xml"/>

  <xsl:template match="/">
    <xsl:copy-of select="."/>
  </xsl:template>

</xsl:stylesheet>
$


Mit besten Gruessen / Best wishes,

Hermann Stamm-Wilbrandt
Level 3 support for XML Compiler team and Fixpack team lead
WebSphere DataPower SOA Appliances
https://www.ibm.com/developerworks/mydeveloperworks/blogs/HermannSW/
https://twitter.com/HermannSW/
----------------------------------------------------------------------
IBM Deutschland Research & Development GmbH
Vorsitzende des Aufsichtsrats: Martina Koederitz
Geschaeftsfuehrung: Dirk Wittkopp
Sitz der Gesellschaft: Boeblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294


|------------>
| From:      |
|------------>
  >-----------------------------------------------------------------------------------------------------------------------------------------|
  |"Costello, Roger L." <costello@m...>                                                                                                |
  >-----------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| To:        |
|------------>
  >-----------------------------------------------------------------------------------------------------------------------------------------|
  |"xml-dev@l..." <xml-dev@l...>,                                                                                         |
  >-----------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Date:      |
|------------>
  >-----------------------------------------------------------------------------------------------------------------------------------------|
  |12/28/2012 09:39 PM                                                                                                                      |
  >-----------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Subject:   |
|------------>
  >-----------------------------------------------------------------------------------------------------------------------------------------|
  | An XML document is not well-formed if encoding="..." does not match the actual encoding of the characters in the document,     |
  |right?                                                                                                                                   |
  >-----------------------------------------------------------------------------------------------------------------------------------------|





Thanks Chris for pointing us to that article: XML on the Web has Failed

I am making my way through it.

This statement in the article piqued my interest:

    ... determining the actual character encoding of an
    XML document is a prerequisite for determining its
    well-formedness ...

I decided to do an experiment.

I created this XML document and encoded each character in the document
using the iso-8859-1 encoding and in the encoding="..." I asserted that I
am using the iso-8859-1 encoding:

<?xml version="1.0" encoding="iso-8859-1"?>
<Name>López</Name>

I checked the document for well-formedness and the XML parser said it is
well-formed.

Good.

Then I changed encoding="iso-8859-1" to encoding="utf-8":

<?xml version="1.0" encoding="utf-8"?>
<Name>López</Name>

I checked it for well-formedness and the parser said it is still
well-formed.

Huh?

Shouldn't I have gotten a well-formedness error?

I did my experiment using the latest version of Oxygen XML. I think that it
uses the Xerces XML Parser, right?

Is this a bug in Xerces?

/Roger



_______________________________________________________________________

XML-DEV is a publicly archived, unmoderated list hosted by OASIS
to support XML implementation and development. To minimize
spam in the archives, you must subscribe before posting.

[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
Or unsubscribe: xml-dev-unsubscribe@l...
subscribe: xml-dev-subscribe@l...
List archive: http://lists.xml.org/archives/xml-dev/
List Guidelines: http://www.oasis-open.org/maillists/guidelines.php




[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member