Hi,
C is in a different group for every variable:
15 years difference implies a different generation/viewpoint
0 child (at 22) implies: not a parent, hence a very different viewpoint,
also without dependent spouse
Income should be calculated "per capita", a couple with 3 children is 5
people for a 100K income, that is 20K for A, and also 20K for B, but 101K
for C
Hence, with no child/dependent, a different generation, and a 5 fold income
per capita, C is in a different class
Purely statistical approaches are useful, yet remain approximative, and
often wrong, in many ways.
Understanding context through the relevant qualified relationships between
implied components can prove faster execution, as well as being more
explicit, precise, meaningful, and reusable,
The only thing is that the implied qualified relationships need to be
modeled and efficiently navigated. Implying quite different code (more
graph and de-referencing oriented).
In the given use case, the data is typical DB, with no explicit qualified
relationships. In fact, these contextual relationships are (partly) in the
programmer's mind. In a simplified case like this one, it may work OK,
dependent on statistical biases.
If the relationships were explicitly modeled, they could serve in many
other cases, and applications could adapt. But if they are in the
designer's mind, one will need a new design and app, every time.
ac
Le dim. 10 mai 2026, C 18 h 38, Michael Kay michaelkay90@xxxxxxxxx <
xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> a C)crit :
> I'm reminded of my first ever paid programming project, which was
> analyzing some data for an achaeologist. He had a few dozen data points and
> wanted to prove some conjecture by finding clusters. The first attempt
> didn't give the results he wanted, so he suggested applying weights to the
> data. That didn't work either, so he suggested different weights. At that
> point I suggested that if he told me what results he wanted, I could
> calculate the weights that would give the desired clustering. With this
> suggestion the penny dropped, namely that given a limited number of data
> points you can prove anything you want.
>
> And I learned a lesson that has guided my career ever since: the customer
> is often wrong.
>
> Michael Kay
> Saxonica
>
> On 10 May 2026, at 19:38, Roger L Costello costello@xxxxxxxxx <
> xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:
>
> Hi Folks,
>
> Table 1 shows three people at your workplace. Which two would you say are
> bclosestb to each other? In other words, which two would naturally
> bclusterb together?
>
> Table 1. Three people at your workplace
> *Person*
> *Age*
> *Kids*
> *Income*
> A
> 36
> 3
> $100,000
> B
> 37
> 2
> $80,000
> C
> 22
> 0
> $101,000
>
> At first glance, persons A and B seem more similar: they are both working
> parents in their mid-30s. Person C looks different: much younger, no kids,
> but with a high income.
>
> However, if the data is not scaled, the income variable can dominate most
> distance formulas. That can make persons A and C appear bcloserb than A
and
> B, simply because A and C have similar incomes.
>
> What does it mean to bscaleb the data?
>
> Scaling means transforming variables so they are on comparable numeric
> ranges.
>
> You are not changing the meaning of the data. You are changing the units
> so one variable does not overpower the others.
>
> Common scaling methods include:
>
> 1. Min-max scaling
> 2. Standardization, also called z-scores
> 3. Normalization
>
> *Example: min-max scaling*
>
> A common approach is to map every variable into the range 0 to 1.
>
> The formula is:
> scaled value = (x - minimum value) / (maximum value - minimum value)
>
> How would you represent Table 1 in XML, given the goal of scaling the data?
>
> Here is a conventional, row-oriented design:
> <Workplace>
> <Staff>
> <Person>A</Person>
> <Age>36</Age>
> <Kids>3</Kids>
> <Income currency="USD">100000</Income>
> </Staff>
> <Staff>
> <Person>B</Person>
> <Age>37</Age>
> <Kids>2</Kids>
> <Income currency="USD">80000</Income>
> </Staff>
> <Staff>
> <Person>C</Person>
> <Age>22</Age>
> <Kids>0</Kids>
> <Income currency="USD">101000</Income>
> </Staff>
> </Workplace>
>
> The XPath expressions for scaling are somewhat verbose because the values
> are distributed across multiple <Staff> elements instead of grouped
> together by feature.
>
> Still, XPath handles it quite nicely.
>
> Min-max scaling of ages:
> for $staff in /Workplace/Staff
> return
> (xs:decimal($staff/Age) - min(/Workplace/Staff/Age ! xs:decimal(.)))
> div
> (max(/Workplace/Staff/Age ! xs:decimal(.)) -
> min(/Workplace/Staff/Age ! xs:decimal(.)))
>
> Min-max scaling of kids:
> for $staff in /Workplace/Staff
> return
> (xs:decimal($staff/Kids) - min(/Workplace/Staff/Kids ! xs:decimal(.)))
> div
> (max(/Workplace/Staff/Kids ! xs:decimal(.)) -
> min(/Workplace/Staff/Kids ! xs:decimal(.)))
>
> Min-max scaling of incomes:
> for $staff in /Workplace/Staff
> return
> (xs:decimal($staff/Income) - min(/Workplace/Staff/Income !
> xs:decimal(.)))
> div
> (max(/Workplace/Staff/Income ! xs:decimal(.)) -
> min(/Workplace/Staff/Income ! xs:decimal(.)))
> XSL-List info and archive <http://www.mulberrytech.com/xsl/xsl-list>
> EasyUnsubscribe <http://lists.mulberrytech.com/unsub/xsl-list/3500899> (by
> email)
>
>
> XSL-List info and archive <http://www.mulberrytech.com/xsl/xsl-list>
> EasyUnsubscribe <http://lists.mulberrytech.com/unsub/xsl-list/3035779> (by
> email <>)
|