Re: [xsl] Using XSLT to build an index

Cart

XML Editor - Download a Free Trial >

See What's New >

Buy Now >

[Home] [By Thread] [By Date] [Recent Entries]

Subject: Re: Using XSLT to build an index
From: "Mark" <mark@xxxxxxxxxxxx>
Date: Sun, 30 Oct 2011 16:11:22 -0700

Hi Ken, In your case, you undoubtedly had context phrases in the text that required special treatment. Unlike indexing a full text as you were attempting, I am indexing a much smaller universe: the words that appear on Czech stamps and their translations into English. Also, I am not going to pay attention to case, number, or any other part of speech [I donbt have to unite references to the same word in different forms, nor collect specific ideas]. I am constructing a simple word list - indexing may be too strong a term for the end product.

I was a professional indexer years ago, and was head of research and development for a firm that wrote software for public, university, and school libraries, so I have a very sound command of those parts of indexing and information theory: its my XPath and XSLT that are so very, very weak :-)

When I abstracted out the words from my XML file for indexing, I also collected ( constructed, actually) the links and link names, so that part is all taken care of.

Let me absorb what you have given me and see where it takes me. Thanks for showing me how deal with the tokenize() output. I'll let you know how it turns out.

Regards,
Mark

-----Original Message----- From: G. Ken Holman Sent: Sunday, October 30, 2011 3:07 PM To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx Subject: Re: Using XSLT to build an index

At 2011-10-30 14:47 -0700, Mark wrote:

The list archives did not seem to contain an XSLT stylesheet that could index an XML file, but I may have missed it. Is it practical to write my own XSLT 2 indexing stylesheet? If so, I have a bilingual XML file that I want to index.


Where you simply want all words, except your stop
words, collected to automate the index
generation, I've never been successful with
automated indexing myself.  For my books I've
authored the components of the index, and then
pointed to those components from within the code.

My assumptions are that I must get rid of the punctuation properly, then isolate the words, sort them, remove stop words, and so on. To get started, I need a bit of help. All of the phrases are found in two attributes: @czech and @eng.

Three questions: (1) I am aware from MichaelC"b,b"s book that regex expressions may be used in the replace() function, but I do not know how to write that regex expression. I would like to remove all the punctuation from a phrase as follows: for everything except a hyphen [-], replacement should be with an empty string; the hyphen should be replaced with a single space.


Simple character removal can be done with
translate() in XSLT 1 or 2 rather than using a regular expression:

translate($inValue,'-,#.$%',' ')

... where the first argument is your input, the second starts with a "-" and then you put anything else in there as characters to remove, the third indicates the hyphen becomes a space and the rest are to be removed.

(2) I assume that to get rid of extra spaces (if any), I can use a construct like: normalize-space(replace(@czech, C"b,Ksome regex expressionC"b,b")).

That will reduce all sequences of white-space characters to a single space.

(3) I assume that tokenize(normalize-space(replace(@czech, 'some regex expression'))) will permit me to write out a list of the words found in those attributes to an XML document. I am not completely clear as to what tokenize() returns, or how to access that return.

tokenize() returns a sequence. But the input is only a single string.

Actually, you want to turn the expression
inside-out to get a list of words from the entire
document then something along these lines should work:

distinct-values(
(//@czech)/tokenize(translate(normalize-space(.),'-,$%.#',' '))  )

That gives you a sequence of unique words.  Can
you work from that in order to do the
hyperlinking, or do you need help there as
well?  Remember you will have to do the same
translation when creating your links, so perhaps
you should have a user function:

mark:words(.) as tokenize(translate(normalize-space($arg),'-,$%.#',' '))

... then use:

(//@czech)/mark:words(.)

... then when creating your links you'll have the function available to ensure the same tokenizing is done at the point in time.

I hope this helps.

. . . . . . . . . . Ken


--
Contact us for world-wide XML consulting and instructor-led training
Crane Softwrights Ltd.            http://www.CraneSoftwrights.com/s/
G. Ken Holman                   mailto:gkholman@xxxxxxxxxxxxxxxxxxxx
Google+ profile: https://plus.google.com/116832879756988317389/about
Legal business disclaimers:    http://www.CraneSoftwrights.com/legal

Current Thread
Using XSLT to build an index Mark - 30 Oct 2011 21:47:50 -0000 G. Ken Holman - 30 Oct 2011 22:07:51 -0000 Michael Kay - 30 Oct 2011 23:07:47 -0000 Mark - 30 Oct 2011 23:24:47 -0000 Mark - 30 Oct 2011 23:11:34 -0000 <= Mark - 31 Oct 2011 05:29:23 -0000 Michael Kay - 31 Oct 2011 07:57:49 -0000 G. Ken Holman - 31 Oct 2011 11:17:24 -0000 Mark - 31 Oct 2011 12:05:31 -0000

<- Previous	Index	Next ->
Re: Using XSLT to build an in, Mark	Thread	Re: Using XSLT to build an in, Mark
Re: Using XSLT to build an in, Michael Kay	Date	Re: Using XSLT to build an in, Mark
	Month

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >