Table of Contents
Fetching ...

Human-aligned Quantification of Numerical Data

Anton Kolonin

TL;DR

The findings suggest that the ability to classify numeric data values into distinct categories is associated with a Silhouette coefficient above 0.65 and a Dip Test below 0.5; otherwise, the data can be treated as following a unimodal normal distribution.

Abstract

Quantifying numerical data involves addressing two key challenges: first, determining whether the data can be naturally quantified, and second, identifying the numerical intervals or ranges of values that correspond to specific value classes, referred to as "quantums," which represent statistically meaningful states. If such quantification is feasible, continuous streams of numerical data can be transformed into sequences of "symbols" that reflect the states of the system described by the measured parameter. People often perform this task intuitively, relying on common sense or practical experience, while information theory and computer science offer computable metrics for this purpose. In this study, we assess the applicability of metrics based on information compression and the Silhouette coefficient for quantifying numerical data. We also investigate the extent to which these metrics correlate with one another and with what is commonly referred to as "human intuition." Our findings suggest that the ability to classify numeric data values into distinct categories is associated with a Silhouette coefficient above 0.65 and a Dip Test below 0.5; otherwise, the data can be treated as following a unimodal normal distribution. Furthermore, when quantification is possible, the Silhouette coefficient appears to align more closely with human intuition than the "normalized centroid distance" method derived from information compression perspective.

Human-aligned Quantification of Numerical Data

TL;DR

The findings suggest that the ability to classify numeric data values into distinct categories is associated with a Silhouette coefficient above 0.65 and a Dip Test below 0.5; otherwise, the data can be treated as following a unimodal normal distribution.

Abstract

Quantifying numerical data involves addressing two key challenges: first, determining whether the data can be naturally quantified, and second, identifying the numerical intervals or ranges of values that correspond to specific value classes, referred to as "quantums," which represent statistically meaningful states. If such quantification is feasible, continuous streams of numerical data can be transformed into sequences of "symbols" that reflect the states of the system described by the measured parameter. People often perform this task intuitively, relying on common sense or practical experience, while information theory and computer science offer computable metrics for this purpose. In this study, we assess the applicability of metrics based on information compression and the Silhouette coefficient for quantifying numerical data. We also investigate the extent to which these metrics correlate with one another and with what is commonly referred to as "human intuition." Our findings suggest that the ability to classify numeric data values into distinct categories is associated with a Silhouette coefficient above 0.65 and a Dip Test below 0.5; otherwise, the data can be treated as following a unimodal normal distribution. Furthermore, when quantification is possible, the Silhouette coefficient appears to align more closely with human intuition than the "normalized centroid distance" method derived from information compression perspective.

Paper Structure

This paper contains 9 sections, 5 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Example of using the "Normalized Centroid Distance" metric: Left - one cluster, 6 distances from one centroid, the total length is large; Center - three clusters, 6 distances for each of the 3 centroids to 2 data points in each plus 3 distances from the centroid to the center, the total length is the smallest; Right - six clusters, 6 distances from the common center of the centroids plus 6 zero distances from each centroid to every 1 point in its cluster, the total length is large. The optimal number of clusters based on the minimum total distance is 3.
  • Figure 2: The original input data are 100 points for $std = 0.1$ (top), corresponding to the obvious 3 distribution modes (middle), and plots for metrics calculated for different numbers of clusters with $SC$ and $NCDC$ agreement at $K = 3$ (bottom).
  • Figure 3: The original input data are 100 points for $std = 0.3$ (top), corresponding to the obvious 2 and possibly 3 distribution modes (middle), and plots for metrics calculated for different numbers of clusters with maximum $SC$ for $K = 2$ and minimum $NCDC$ for $K = 3$ (bottom).
  • Figure 4: The original input data are 100 points for $std = 1.0$ (top), corresponding to a single apparent mode or normal distribution (middle), and plots for metrics calculated for different numbers of clusters without an expressive $SC$ maximum, with $NCDC$ minimum for $K = 1$ (bottom).
  • Figure 5: The most revealing results compare human estimates of the number of clusters or distribution modes with the numbers found according to computed metrics such as $NCDC$ and $SSC+$. The upper half is for the case of 100 data points, the lower half is for the case of 1000 data points. The distributions are labeled with the letters I, J, K, L, M, N, O, P. The color graphs represent the data points for all data points corresponding to the respective distributions with the corresponding color legend on the right. The pie charts below the letters correspond to the diversity of human estimates. On the left side, in the three sections for the $NCDC$ metric, the $SC+$ metric, and the most typical human estimate, the selected $K$ numbers are displayed in the corresponding columns next to the letters denoting the distributions.