I think there are two basic approaches to this kind of problem. One is to
convert the punctuation into tags, and then manipulate the resulting tree
structure; the other is to turn the embedded tags into punctuation (like
"[emphasis]two[/emphasis]") and then manipulate the content as a character
string. My instinct, like Martin Honnen's, is to do the first.
There are still complications, of course. For example if you're detecting
end-of-sentence as [.?!] followed by a space or end-of-paragraph, then it's
challenging to handle the case where the [.?!] is the last character in a text
node but the text node isn't the last thing in the paragraph. (For example
"sentence.<footnote>x</footnote> "). There's no easy answer to this (and
natural language being what it is, there is no right answer either).
Michael Kay
Saxonica
> On 24 Nov 2019, at 13:34, Rick Quatro rick@xxxxxxxxxxxxxx
<xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:
>
> Hi All,
>
> I have a situation where I want to split a short paragraph into sentences
and use them in different parts of my output. I am using <xsl:analyze-string>
because I want to account for a sentence ending with a . or ?. This will work
except if there are any children of the paragaph, like the <emphasis> in the
second sentence. Can I split a paragraph into sentences and still keep the
markup?
>
> Here is my input document:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <root>
> <p>This has one sentence? Actually, it has <emphasis>two</emphasis>. No,
it has three.</p>
> </root>
>
> My stylesheet:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform
<http://www.w3.org/1999/XSL/Transform>"
> xmlns:xs="http://www.w3.org/2001/XMLSchema
<http://www.w3.org/2001/XMLSchema>"
> xmlns:rq="http://www.frameexpert.com <http://www.frameexpert.com/>"
> exclude-result-prefixes="xs rq"
> version="2.0">
>
> <xsl:output indent="yes"/>
> <xsl:strip-space elements="root"/>
>
> <xsl:template match="/root">
> <xsl:copy>
> <xsl:apply-templates/>
> </xsl:copy>
> </xsl:template>
>
> <xsl:template match="p">
> <xsl:variable name="sentences"
select="rq:splitParagraphIntoSentences(.)"/>
> <p><xsl:value-of select="$sentences[1]"/></p>
> <note>Something in between.</note>
> <p><xsl:value-of select="$sentences[position()>1]"/></p>
> </xsl:template>
>
> <xsl:function name="rq:splitParagraphIntoSentences">
> <xsl:param name="paragraph"/>
> <xsl:analyze-string select="$paragraph" regex=".+?[\.\?](\s+|$)">
> <xsl:matching-substring>
> <sentence><xsl:value-of
select="replace(.,'\s+$','')"/></sentence>
> </xsl:matching-substring>
> </xsl:analyze-string>
> </xsl:function>
> </xsl:stylesheet>
>
> My output:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <root>
> <p>This has one sentence?</p>
> <note>Something in between.</note>
> <p>Actually, it has two. No, it has three.</p>
> </root>
>
> What I want is this:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <root>
> <p>This has one sentence? </p>
> <note>Something in between.</note>
> <p>Actually, it has <emphasis>two</emphasis>. No, it has three. </p>
> </root>
>
> Any suggestions will be appreciated.
>
> Rick
> XSL-List info and archive <http://www.mulberrytech.com/xsl/xsl-list>
> EasyUnsubscribe <http://lists.mulberrytech.com/unsub/xsl-list/293509> (by
email <>)
|