A big simplification in scaling comes from designing the XML so that the data
already represents a feature vector:
<Workplace>
<Feature name="Age">
<Value person="A">36</Value>
<Value person="B">37</Value>
<Value person="C">22</Value>
</Feature>
<Feature name="Kids">
<Value person="A">3</Value>
<Value person="B">2</Value>
<Value person="C">0</Value>
</Feature>
<Feature name="Income" currency="USD">
<Value person="A">100000</Value>
<Value person="B">80000</Value>
<Value person="C">101000</Value>
</Feature>
</Workplace>
Now the scaling logic can be generic. It works for any feature because every
feature has the same structure:
$feature/Value
The generic transformation can be written in XPath as follows:
for $feature in /Workplace/Feature
return
for $value in $feature/Value
return
(xs:decimal($value) - min($feature/Value ! xs:decimal(.)))
div
(max($feature/Value ! xs:decimal(.)) -
min($feature/Value ! xs:decimal(.)))
That is a significant improvement.
You no longer need separate expressions for:
/Workplace/Staff/Age
/Workplace/Staff/Kids
/Workplace/Staff/Income
Instead, every variable has the same shape:
Feature/Value
So the algorithm becomes:
for each feature:
scale all of its values
That is much closer to how data science thinks:
for each column:
scale the values in that column
So the best XML design for this kind of analysis is not merely
"column-oriented."
It is regular, metadata-driven, feature-oriented XML.
The key design principle is:
Make every feature look structurally identical.
That is what makes the scaling logic dramatically simpler and more reusable.
/Roger
|