Subject: RE: distinct-values() optimization, sorting by frequency
From: "Michael Kay" <mike@xxxxxxxxxxxx>
Date: Fri, 8 Feb 2008 14:48:28 -0000
|
In the alphabetical list,
count($persNames[normalize-space(lower-case(.)) =$current-name])"/
could be optimized by:
(a) using keys
(b) using Saxon-SA which will optimize it to use a key automatically
(c) using xsl:for-each-group rather than distinct-values(), though that will
require some restructuring of your code.
In the frequency-sorted list, I think for-each-group would definitely be
better:
<xsl:for-each-group select="$persNames" group-by="lower-case(.)">
<xsl:sort select="count(current-group())"/>
...
(Note also the use of a case-blind collation rather than lower-case(),
discussed in another thread today)
Michael Kay
http://www.saxonica.com/
> -----Original Message-----
> From: James Cummings [mailto:cummings.james@xxxxxxxxx]
> Sent: 08 February 2008 14:28
> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> Subject: distinct-values() optimization, sorting by frequency
>
> Hiya,
>
> I'm wondering the best way to optimize a distinct-values()
> based transformation. What I'm basically doing is:
> ======
> <xsl:variable name="docs"
> select="collection('../../working/xml/files.xml')"/>
>
> <xsl:template name="main" >
> <xsl:variable name="persNames"
> select="$docs//tei:text//tei:persName"/>
> <xsl:variable name="norm-persNames"
> select="$persNames/normalize-space(lower-case(.))"/>
> <xsl:variable name="distinct-persNames"
> select="distinct-values($norm-persNames)"/>
> <!-- I realize that I could be more specific on the
> $persNames variable, but doing so doesn't seem to affect
> speed much at all. --> <div type="main">
>
> <!-- Some overall counts -->
> <div><head>Overall Counts</head>
> <list type="unordered">
> <item>Number of <gi>persName</gi> elements total:
> <xsl:value-of select="count($persNames)"/></item>
> <item>Number of <gi>persName</gi> elements which have a
> @key attribute total: <xsl:value-of
> select="count($persNames[@key])"/></item>
> <item>Number of distinct-value <gi>persName</gi> elements total:
> <xsl:value-of select="count($distinct-persNames)"/></item>
> </list></div>
>
> <!-- An Alphabetical List -->
> <div><head>Alphabetical List</head>
> <list type="unordered">
> <xsl:for-each select="$distinct-persNames">
> <xsl:sort select="."/>
> <xsl:variable name="current-name" select="."/>
> <xsl:variable name="count-distinct-current-name"
> select="count($persNames[normalize-space(lower-case(.))
> =$current-name])"/>
> <item><xsl:value-of select="concat($current-name,
> ' -- ', $count-distinct-current-name)"/></item>
> </xsl:for-each>
> </list>
> </div>
>
> <!-- A Frequency Sorted List -->
> <div>
> <head>Frequency List</head>
> <list type="unordered">
> <xsl:for-each select="$distinct-persNames">
> <xsl:sort
> select="count($persNames[normalize-space(lower-case(.))
> = .])"/>
> <!-- I think it is this sort statement which slows things
> down, since I have to repeat it twice. -->
> <xsl:variable name="current-name" select="."/>
> <xsl:variable name="count-distinct-current-name"
> select="count($persNames[normalize-space(lower-case(.))
> = $current-name])"/>
> <item><xsl:value-of select="concat($count-distinct-current-name,
> ' -- ', $current-name)"/> </item>
> </xsl:for-each>
> </list>
> </div>
> </div>
> ======
>
> I think the real slow-down comes in the second xsl:for-each
> where I want to sort by frequency of distinct-value by doing:
> <xsl:sort
> select="count($persNames[normalize-space(lower-case(.)) =
> .])"/> I have to have it for the sort, and then I have to
> re-do it for the output inside the <item> element. I'm
> obviously not allowed a variable between the for-each and the
> sort... but I have a feeling I'm missing some clever
> optimization here.
>
> Although this is for a pre-generated transformation, it
> currently takes a *hugely* long time, and I'm thinking I must
> be able to optimize it somehow.
>
> Any suggestions appreciated,
>
> -James
|