Subject: RE: Generating numeric character references
From: "Andrew Welch" <AWelch@xxxxxxxxxxxxxxx>
Date: Thu, 16 Jan 2003 09:44:37 -0000
|
I think the original poster had a problem of double escaping, such as
& a m p ; # 1 7 3 ;
in their source, and they simply wanted to convert this to the correct & # 1 7 3 ;
Wouldn't running the source xml through an indentity transform would give the desired result, no need for string processing of any kind.....
cheers
andrew
> -----Original Message-----
> From: Wendell Piez [mailto:wapiez@xxxxxxxxxxxxxxxx]
> Sent: 14 January 2003 21:55
> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> Subject: Re: Generating numeric character references
>
>
> Stuart,
>
> The reason your task is proving difficult is that it's really
> not what it
> appears to be at first blush. There is a trap here, which you
> can recognize
> if you can clearly distinguish between XML-as-serialization
> format, and the
> XML document (a tree of nodes as described in the XPath spec)
> that an XSLT
> processor operates on.
>
> Numeric character references may appear in
> XML-as-serialization; in the
> XPath tree (the "document" built by the parser and handed to the XSLT
> engine), however, these references never appear as such;
> rather, each has
> been converted into the character it represents.
>
> So, for example, if your data has character reference A,
> your XSLT
> processor sees this as an "A". (It may come out the back as
> "A" if
> your serialization encoding happens not to be able to do a
> proper "A", but
> internally it's an "A"). Therefore, what's required with
> "&#x41;" isn't
> to turn it into "A", but rather into "A". (Or, if you
> get my drift:
> you need to convert "&#x41;" into "A" *before* your
> document is
> parsed, or an "A" into an "A" *after* your document is parsed.)
>
> You are currently trying to do the latter; and it can be done
> -- as you're
> discovering -- with recursive processing over text nodes,
> heuristics to
> recognize target substrings, and a table to map them. But
> it's not a job
> that XSLT lends itself towards, since XSLT is as ungainly for
> processing
> strings as it is slick for processing nodes. Far preferable
> would be to use
> Perl or something else with good support for string-handling
> and regular
> expressions, to do the former task (munge the & entities
> before parsing).
>
> Yet -- and this is where one has to be *very* cautious --
> XSLT does, at
> least in certain circumstances (i.e. with certain processors
> in certain
> operational contexts) give you *some* control over how a
> document, once
> processed, is serialized -- and *if your data is clean* this optional
> feature can be drafted into service to help with your
> problem. What I'm
> getting to, of course, is the dreaded disable-output-escaping....
>
> That is, if your data is otherwise unproblematic, you can
> achieve your goal
> by running your document through a near-identity transform
> that disables
> output escaping on your text nodes. The document will emerge from the
> transform unchanged (at least as XPath sees it) but with "&#x41"
> represented as "A". This, *when parsed again*, will be
> seen as the "A"
> you really want.
>
> Note that this is not (if we're really strict with our terms) a
> transformation in the XSLT sense. Rather, it's a tricky
> application of the
> serializer attached to most processors, will sometimes break
> because it
> disables escaping on the wrong characters (so if you have any
> data such as
> "if x < y", you're going to be in trouble unless you write
> string-processing code to catch and work around it), and uses
> an optional
> feature of the language that restricts portability.
>
> Please consider this only a golden-hammer solution (i.e.
> lacking a better
> tool to do the job), and keep in mind it's easy to bang your
> thumb this way
> (since any anomalies in the input will make your output not
> well-formed).
> It is in view of these limitations that this really should be
> done in a
> separate pass, if with XSLT at all.
>
> Cheers,
> Wendell
>
> At 03:05 PM 1/14/2003, you wrote:
> >I'd like to transform specific text subtrings into numeric character
> >references during in an XSLT transformation. For example, I want to
> >transform all occurrences that look like "&#173;" within a string
> >into "­".
> >
> >Here's the back story. I have source XML that is generated
> automatically
> >from HTML by a third-party. The third-party incorrectly
> handles entity
> >references, so that "­" in the original HTML in becomes
> >"&#173;" in the XML. I want to restore the damage done.
> To simplify
> >things, I am only interested in documents with ISO 8859-1 encoding.
> >
> >Below is a solution [1] that I am not pleased with. It is a named
> >template that recursively parses a string, trying to replace
> references.
> >This requires an <xsl:when> element for each value of
> numeric character
> >reference that might be encountered (see the "additional cases here"
> >comment). Problems with this include linear search of values, omitted
> >values, and opportunity for error in mismatched values.
> >
> >Can anyone suggest a better approach to generating numeric character
> >references? I am would be fine restricting the solution to MSXML or
> >.NET's System.Xml.Xsl XSLT processors, if that is an issue.
> >
> >Many thanks!
> >
> >Cheers,
> >Stuart
> >
> >
> >
> >[1] A less than happy solution:
> >
> > <xsl:template name="restoreNumCharRefs">
> > <xsl:param name="string"/>
> >
> > <xsl:choose>
> > <xsl:when test="contains($string, '&')">
> > <xsl:variable name="head" select="substring-before($string,
> >'&')"/>
> > <xsl:variable name="remainder"
> select="substring-after($string,
> >'&')"/>
> > <xsl:variable name="reference"
> >select="substring-before($remainder, ';')"/>
> >
> > <xsl:variable name="entity">
> > <xsl:choose>
> > <xsl:when test="$reference='#167'">§</xsl:when>
> > <xsl:when test="$reference='#173'">­</xsl:when>
> >
> > <!-- additional cases here -->
> >
> > <xsl:otherwise>&<xsl:value-of
> >select="$reference"/>;</xsl:otherwise>
> > </xsl:choose>
> > </xsl:variable>
> >
> > <xsl:variable name="tail">
> > <xsl:call-template name=" restoreNumCharRefs">
> > <xsl:with-param name="string"
> >select="substring-after($remainder, ';')"/>
> > </xsl:call-template>
> > </xsl:variable>
> >
> > <xsl:value-of select="concat($head, $entity, $tail)"/>
> > </xsl:when>
> > <xsl:otherwise>
> > <xsl:value-of select="$string"/>
> > </xsl:otherwise>
> > </xsl:choose>
> >
> > </xsl:template>
> >
> >
> > XSL-List info and archive:
http://www.mulberrytech.com/xsl/xsl-list
======================================================================
Wendell Piez mailto:wapiez@xxxxxxxxxxxxxxxx
Mulberry Technologies, Inc. http://www.mulberrytech.com
17 West Jefferson Street Direct Phone: 301/315-9635
Suite 207 Phone: 301/315-9631
Rockville, MD 20850 Fax: 301/315-8285
----------------------------------------------------------------------
Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
---
Incoming mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.441 / Virus Database: 247 - Release Date: 09/01/2003
---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.441 / Virus Database: 247 - Release Date: 09/01/2003
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
| Current Thread |
|
Passin, Tom - Tue, 14 Jan 2003 16:33:07 -0500 (EST)
Andrew Welch - Thu, 16 Jan 2003 04:43:45 -0500 (EST) <=
Yates, Danny (ANTS) - Thu, 16 Jan 2003 04:56:26 -0500 (EST)
Andrew Welch - Thu, 16 Jan 2003 05:23:41 -0500 (EST)
|
|