Hi Folks,
Table 1 shows three people at your workplace. Which two would you say are
"closest" to each other? In other words, which two would naturally "cluster"
together?
Table 1. Three people at your workplace
Person
Age
Kids
Income
A
36
3
$100,000
B
37
2
$80,000
C
22
0
$101,000
At first glance, persons A and B seem more similar: they are both working
parents in their mid-30s. Person C looks different: much younger, no kids, but
with a high income.
However, if the data is not scaled, the income variable can dominate most
distance formulas. That can make persons A and C appear "closer" than A and B,
simply because A and C have similar incomes.
What does it mean to "scale" the data?
Scaling means transforming variables so they are on comparable numeric
ranges.
You are not changing the meaning of the data. You are changing the units so
one variable does not overpower the others.
Common scaling methods include:
1. Min-max scaling
2. Standardization, also called z-scores
3. Normalization
Example: min-max scaling
A common approach is to map every variable into the range 0 to 1.
The formula is:
scaled value = (x - minimum value) / (maximum value - minimum value)
How would you represent Table 1 in XML, given the goal of scaling the data?
Here is a conventional, row-oriented design:
<Workplace>
<Staff>
<Person>A</Person>
<Age>36</Age>
<Kids>3</Kids>
<Income currency="USD">100000</Income>
</Staff>
<Staff>
<Person>B</Person>
<Age>37</Age>
<Kids>2</Kids>
<Income currency="USD">80000</Income>
</Staff>
<Staff>
<Person>C</Person>
<Age>22</Age>
<Kids>0</Kids>
<Income currency="USD">101000</Income>
</Staff>
</Workplace>
The XPath expressions for scaling are somewhat verbose because the values are
distributed across multiple <Staff> elements instead of grouped together by
feature.
Still, XPath handles it quite nicely.
Min-max scaling of ages:
for $staff in /Workplace/Staff
return
(xs:decimal($staff/Age) - min(/Workplace/Staff/Age ! xs:decimal(.)))
div
(max(/Workplace/Staff/Age ! xs:decimal(.)) -
min(/Workplace/Staff/Age ! xs:decimal(.)))
Min-max scaling of kids:
for $staff in /Workplace/Staff
return
(xs:decimal($staff/Kids) - min(/Workplace/Staff/Kids ! xs:decimal(.)))
div
(max(/Workplace/Staff/Kids ! xs:decimal(.)) -
min(/Workplace/Staff/Kids ! xs:decimal(.)))
Min-max scaling of incomes:
for $staff in /Workplace/Staff
return
(xs:decimal($staff/Income) - min(/Workplace/Staff/Income ! xs:decimal(.)))
div
(max(/Workplace/Staff/Income ! xs:decimal(.)) -
min(/Workplace/Staff/Income ! xs:decimal(.)))
|