Subject: Re: [XSLT2.0] xsl:analyze-string@regex syntax too limited
From: Gunther Schadow <gunther@xxxxxxxxxxxxxxxxxxxxxx>
Date: Thu, 16 Dec 2004 18:14:56 -0500
|
Thanks, good find. The only problem now is that this issue needs to be
adressed in java.util.regex.
Colin Paul Adams wrote:
>>>>>>"Gunther" == Gunther Schadow <gunther@xxxxxxxxxxxxxxxxxxxxxx> writes:
>
>
> Gunther> The boundary matcher matches a zero-width substring
> Gunther> between a character matching the character class
> Gunther> [A-Za-z_0-9] and a character matching the character class
> Gunther> [^A-Za-z_0-9] or vice versa. </quote>
>
> Gunther> This is pretty clear. It may not make the
> Gunther> internationalization people very happy because I can't do
> Gunther> word-boundary matches on Hindi text. That's a true
> Gunther> concern.
>
> So address it. Unicode report TR18 says (for Level 1 support):
>
> RL1.4 Simple Word Boundaries
> To meet this requirement, an implementation shall extend the word boundary mechanism so that:
>
> 1.
>
> The class of <word_character> includes all the Alphabetic values from the Unicode character database, from UnicodeData.txt [UData]. See also Annex C: Compatibility Properties.
> 2.
>
> Non-spacing marks are never divided from their base characters, and otherwise ignored in locating boundaries.
>
> Level 2 provides more general support for word boundaries between
> arbitrary Unicode characters which may override this behavior.
>
> Level 1 support should certainly be met.
--
Gunther Schadow, M.D., Ph.D. gschadow@xxxxxxxxxxxxxxx
Associate Professor Indiana University School of Informatics
Regenstrief Institute, Inc. Indiana University School of Medicine
tel:1(317)630-7960 http://aurora.regenstrief.org
|