Subject: RE: XSLT 2.0 : Unicode hex notation in regular expressions
From: "Michael Kay" <mike@xxxxxxxxxxxx>
Date: Mon, 12 Jun 2006 21:28:30 +0100
|
The CJKCompatibility block covers the codepoint range x3300-x33FF only. I
would imagine that to match Japanese language characters you are looking for
a much larger range than this.
If the range of codepoints you want to match doesn't correspond to one of
the named blocks you can always write, for example [&_#x3000;-&_#xFE4F;]
(without the underscores).
Michael Kay
http://www.saxonica.com/
> -----Original Message-----
> From: jbesch@xxxxxxx [mailto:jbesch@xxxxxxx]
> Sent: 12 June 2006 20:26
> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> Cc: jbesch@xxxxxxx
> Subject: Re: XSLT 2.0 : Unicode hex notation in regular
> expressions
>
> > How, for example, to use a useful syntax like
> > matches(.,'\p{Script:Arabic}+') ?
> >
> >schema-2 says: http://www.w3.org/TR/xmlschema-2/#regexs
> >
> >[Definition:] [Unicode Database] groups code points into a number of
> >blocks such as Basic Latin (i.e., ASCII), Latin-1 Supplement, Hangul
> >Jamo, CJK Compatibility, etc. The set containing all characters that
> >have block name X (with all white space stripped out), can be
> >identified with a block escape \p{IsX}. The complement of
> this set is
> >specified with the block escape \P{IsX}. ([\P{IsX}] = [^\p{IsX}]).
> >...
> >For example,
> >the .block escape. for identifying the ASCII characters is
> \p{IsBasicLatin}.
> >
> >so that would be \p(IsArabic)
> >
> >David
>
>
>
> I want to use the above construct to detect Japanese
> characters, and so I am using the following xsl:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <xsl:stylesheet version="2.0"
> xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
> <xsl:output method="xml" indent="yes" encoding="UTF-8" />
> <xsl:template match="/text">
> <xsl:for-each select="tokenize(.,'\s+')">
> <word>
> <xsl:attribute name="language">
> <xsl:choose>
> <xsl:when
> test="matches(.,'\p{IsCJKCompatibility}+')">Japanese</xsl:when>
> <xsl:when
> test="matches(.,'\p{IsBasicLatin}+')">Latin</xsl:when>
> <xsl:otherwise>Unknown</xsl:otherwise>
> </xsl:choose>
> </xsl:attribute>
> </word>
> </xsl:for-each>
> </xsl:template>
> </xsl:stylesheet>
>
> However, the Japanese characters in my input, which are
> encoded in UTF-8, come out flagged as Latin or Unknown. What
> am I doing wrong? How do I get this to recognize the
> Japanese characters?
>
> Thanks for any help you can offer.
>
> John Besch
|