Subject: Re: Tokenization - Thai language
From: "Tony Graham" <tgraham@xxxxxxxxxx>
Date: Wed, 15 Jun 2011 12:22:00 +0100 (IST)
|
On Wed, June 15, 2011 11:24 am, Jan Pour wrote:
> I would like to tokenize Thai text on all places, where it can be
> broken to new line.
> How could I do it in XSLT? Using extensions in java??
My first thought would be to build an extension based on the International
Components for Unicode [1]. See, e.g., the documentation on boundary
analysis [2].
You wouldn't get very far tokenizing using regular expressions based on
'\w' or '\W' since, as you probably know, Thai ordinarily doesn't have
separators between words.
Regards,
Tony Graham tgraham@xxxxxxxxxx
Consultant http://www.mentea.net
Mentea 13 Kelly's Bay Beach, Skerries, Co. Dublin, Ireland
-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
XML, XSL FO and XSLT consulting, training and programming
[1] http://site.icu-project.org/
[2] http://userguide.icu-project.org/boundaryanalysis
|