Subject: Re: marking up text when term from other file is found
From: Mukul Gandhi <gandhi.mukul@xxxxxxxxx>
Date: Thu, 22 Apr 2010 11:51:10 +0530
|
I would try to solve this as, following:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="2.0">
<xsl:output method="xml" indent="yes" />
<xsl:variable name="index-terms" select="document('indexTerms.xml')" />
<xsl:template match="node() | @*">
<xsl:copy>
<xsl:apply-templates select="node() | @*" />
</xsl:copy>
</xsl:template>
<xsl:template match="text()" priority="10">
<xsl:analyze-string select="."
regex="{string-join(for $term in
$index-terms/terms/term return concat('(', $term, ')'), '|')}">
<xsl:matching-substring>
<xsl:variable name="idVal" select="string-join(for $attrVal in
$index-terms/terms/term[. =
regex-group(0)]/@*[starts-with(name(),'index')] return $attrVal, '_')"
/>
<ph id="{$idVal}">
<xsl:value-of select="." />
</ph>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:value-of select="." />
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>
</xsl:stylesheet>
You may adapt this, to suit your requirements if needed.
On Thu, Apr 22, 2010 at 8:38 AM, Hoskins & Gretton
<hoskgret@xxxxxxxxxxxxxxxx> wrote:
>
> HI, I need help finding resources (examples and/or XSL) for this situation,
> for which I haven't found quite the right recipe in my searches of the list
> archives.
> Given an XML file containing a list of terms and another file containing a
> mix of elements containing text (narrative content, some inline markup for
> emphasis and footnotes), I was asked if I could find occurrences of each
> term wherever it appeared in the narrative content, and wrap each
occurrence
> with a tag. So my first thought is to load up each document into a
variable.
> But then I don't know what the most effective method of string comparison
> would be, given that the narrative document might have the term's words
with
> different capitalization. If anyone can point me in the right direction,
I'd
> appreciate it. Also I would like to know if there is a practical limit to
> how large a narrative file I can run with about 150 terms to find in the
> B text. And if a different approach B would work better, such as writing
Java
> to do B brute force search and replace, please tell me so. (I work with a
> Java programmer. Everything looks like a Java problem to her and an XSL
> problem to me.)
> -- Dorothy
> Note: Using Saxon B 9.1.0.7. I just made up a set of terms and a bad
> sentence as an example.
> Example of terms (indexTerms.xml):
> <?xml version="1.0" encoding="UTF-8"?>
> <terms>
> B <term index1="anxiety">Anxiety</term>
> B <term index1="children">Children</term>
> B <term index1="children" index2="illness">Children, illness</term>
> B <term index1="children" index2="nightmare">Children, nightmare</term>
> B <term index1="cure" index2="fever">Cure fever</term>
> B <term index1="cure" index2="illness">Cure illness</term>
> B <term index1="anxiety" index2="nightmare">Nightmare</term>
> B <term index1="children" index2="illness">Sick children</term>
> B <term index1="anxiety" index2="phobia">Worries, phobias and
anxiety</term>
> </terms>
>
> Example of narrative (sampleTopic.xml):
> <?xml version='1.0' encoding='UTF-8'?>
> <!DOCTYPE topic PUBLIC "-//OASIS//DTD DITA Topic//EN"
> "http://docs.oasis-open.org/dita/v1.1/OS/dtd/topic.dtd">
> <topic id="sampleTopic">
> B <title>sampleTopic</title>
> B <body>
> B B <p>markup for sample terms testing a set of phrases to match to the
> content of index terms:</p>
> B B <p>Texttexttext text some of the terms are already in <ph> i.e.
<ph
> id="cure_fever">curing fever</ph>, <ph id="children_illness">sick
> children</ph> and sometime the same terms occur, <i>but different case</i>,
> not in a ph: Curing fever and <b>Sick children</b>. I need to get all the
> occurrences of each of the term element strings marked up with <ph>
> </p>
> B </body>
> </topic>
>
> Desired result:
> <?xml version='1.0' encoding='UTF-8'?>
> <!DOCTYPE topic PUBLIC "-//OASIS//DTD DITA Topic//EN"
> "http://docs.oasis-open.org/dita/v1.1/OS/dtd/topic.dtd">
> <topic id="sampleTopic">
> B <title>sampleTopic</title>
> B <body>
> B B <p>markup for sample terms testing a set of phrases to match to the
> content of index terms:</p>
> B B <p>Texttexttext text some of the terms are already in <ph> i.e.
<ph
> id="cure_fever">curing fever</ph>, <ph id="children_illness">sick
> children</ph> and sometime the same terms occur, <i>but different case</i>,
> not in a ph: <ph id="cure_fever">Curing fever</ph> and <b><ph
> id="children_illness">Sick children</ph></b>. I need to get all the
> occurrences of each of the term element strings marked up with <ph>
> </p>
> B </body>
> </topic>
>
> XSL:
> <?xml version="1.0" encoding="UTF-8"?>
> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
> version="2.0">
> <xsl:param name="indexFile">indexTerms.xml</xsl:param>
> <xsl:param name="textFile">sampleTopic.xml</xsl:param>
> <xsl:variable name="termsDocument"
> select="document($indexFile)"></xsl:variable>
> <xsl:variable name="textDocument"
> select="document($textFile)"></xsl:variable>
> <xsl:template match="*" name="test1"><xsl:result-document
> href="matchText-test.xml" method="xml">
> <!-- proof that I can get the terms -->
> <xsl:text> </xsl:text><xsl:comment><xsl:text>first term is
> </xsl:text><xsl:value-of
> select="$termsDocument/terms/term[1]"/></xsl:comment>
> <xsl:text> </xsl:text><xsl:comment><xsl:text>second term is
> </xsl:text><xsl:value-of
> select="$termsDocument/terms/term[2]"/></xsl:comment>
> <xsl:text> </xsl:text><xsl:comment><xsl:text>third term is
> </xsl:text><xsl:value-of
> select="$termsDocument/terms/term[3]"/></xsl:comment>
> <!-- now how to I find them in the $textDocument elements and mark them up?
> -->
> </xsl:result-document>
> </xsl:template>
> </xsl:stylesheet>
--
Regards,
Mukul Gandhi
|