I'm reminded of my first ever paid programming project, which was analyzing
some data for an achaeologist. He had a few dozen data points and wanted to
prove some conjecture by finding clusters. The first attempt didn't give the
results he wanted, so he suggested applying weights to the data. That didn't
work either, so he suggested different weights. At that point I suggested that
if he told me what results he wanted, I could calculate the weights that would
give the desired clustering. With this suggestion the penny dropped, namely
that given a limited number of data points you can prove anything you want.
And I learned a lesson that has guided my career ever since: the customer is
often wrong.
Michael Kay
Saxonica
> On 10 May 2026, at 19:38, Roger L Costello costello@xxxxxxxxx
<xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:
>
> Hi Folks,
>
> Table 1 shows three people at your workplace. Which two would you say are
bclosestb to each other? In other words, which two would naturally
bclusterb together?
>
> Table 1. Three people at your workplace
>
> Person
> Age
> Kids
> Income
> A
> 36
> 3
> $100,000
> B
> 37
> 2
> $80,000
> C
> 22
> 0
> $101,000
> At first glance, persons A and B seem more similar: they are both working
parents in their mid-30s. Person C looks different: much younger, no kids, but
with a high income.
>
> However, if the data is not scaled, the income variable can dominate most
distance formulas. That can make persons A and C appear bcloserb than A
and B, simply because A and C have similar incomes.
>
> What does it mean to bscaleb the data?
>
> Scaling means transforming variables so they are on comparable numeric
ranges.
>
> You are not changing the meaning of the data. You are changing the units so
one variable does not overpower the others.
>
> Common scaling methods include:
>
> Min-max scaling
> Standardization, also called z-scores
> Normalization
> Example: min-max scaling
>
> A common approach is to map every variable into the range 0 to 1.
>
> The formula is:
>
> scaled value = (x - minimum value) / (maximum value - minimum value)
> How would you represent Table 1 in XML, given the goal of scaling the data?
>
> Here is a conventional, row-oriented design:
>
> <Workplace>
> <Staff>
> <Person>A</Person>
> <Age>36</Age>
> <Kids>3</Kids>
> <Income currency="USD">100000</Income>
> </Staff>
> <Staff>
> <Person>B</Person>
> <Age>37</Age>
> <Kids>2</Kids>
> <Income currency="USD">80000</Income>
> </Staff>
> <Staff>
> <Person>C</Person>
> <Age>22</Age>
> <Kids>0</Kids>
> <Income currency="USD">101000</Income>
> </Staff>
> </Workplace>
> The XPath expressions for scaling are somewhat verbose because the values
are distributed across multiple <Staff> elements instead of grouped together
by feature.
>
> Still, XPath handles it quite nicely.
>
> Min-max scaling of ages:
>
> for $staff in /Workplace/Staff
> return
> (xs:decimal($staff/Age) - min(/Workplace/Staff/Age ! xs:decimal(.)))
> div
> (max(/Workplace/Staff/Age ! xs:decimal(.)) -
> min(/Workplace/Staff/Age ! xs:decimal(.)))
> Min-max scaling of kids:
>
> for $staff in /Workplace/Staff
> return
> (xs:decimal($staff/Kids) - min(/Workplace/Staff/Kids ! xs:decimal(.)))
> div
> (max(/Workplace/Staff/Kids ! xs:decimal(.)) -
> min(/Workplace/Staff/Kids ! xs:decimal(.)))
> Min-max scaling of incomes:
>
> for $staff in /Workplace/Staff
> return
> (xs:decimal($staff/Income) - min(/Workplace/Staff/Income !
xs:decimal(.)))
> div
> (max(/Workplace/Staff/Income ! xs:decimal(.)) -
> min(/Workplace/Staff/Income ! xs:decimal(.)))
> XSL-List info and archive <http://www.mulberrytech.com/xsl/xsl-list>
> EasyUnsubscribe <http://lists.mulberrytech.com/unsub/xsl-list/3500899> (by
email <>)
|