Table of Contents
Fetching ...

Boltzmann-Shannon Index: A Geometric-Aware Measure of Clustering Balance

Emanuele Bossi, C. Tyler Diggans, Abd AlRahman R. AlMomani

TL;DR

This paper tackles the problem that standard clustering validity metrics fail to capture how well a partition reflects both the frequency of cluster occupancy and the underlying geometry of the state space for continuous data. It introduces the Boltzmann-Shannon Index, defined as 1 minus the Jensen-Shannon Divergence between a frequency-based distribution of cluster labels and a geometry-based distribution derived from an SVD-based measure of each cluster's volume. The authors demonstrate that BSI rewards density-balanced partitions and penalizes misaligned geometry and frequency, with near-unity scores on Iris and meaningful sensitivity in synthetic and resource-allocation scenarios. The measure is differentiable and can be used as a regularizer in optimization, offering a practical tool for fair and balanced partitioning in continuous domains and complex dynamical systems.

Abstract

The Boltzmann-Shannon Index (BSI) for clustered continuous data is introduced as a normalized measure that captures the relationship between geometry-based and frequency-based probability distributions defined over the clusters. In essence, it quantifies the similarity across densities of the clusters, which are defined by a given labeling. This labeling may originate from a geometric partitioning of the state space itself, but need not in general. We illustrate its performance on synthetic Gaussian mixtures, the Iris benchmark data set, and a high-imbalance resource-allocation scenario, showing that the BSI provides a coherent assessment in cases where traditional metrics give incomplete or misleading signals. Moreover, in the resource-allocation setting where equal density may be associated with a "fair" distribution, we demonstrate that BSI not only detects inequality with high sensitivity, but also offers a numerically smooth measure that can be easily embedded in optimization frameworks as a regularization term for modern policy-making. Finally, the BSI also offers a new measure of the effectiveness for a given symbolic representation, i.e. coarse-grain states, for continuous-valued data recorded from complex dynamical systems.

Boltzmann-Shannon Index: A Geometric-Aware Measure of Clustering Balance

TL;DR

This paper tackles the problem that standard clustering validity metrics fail to capture how well a partition reflects both the frequency of cluster occupancy and the underlying geometry of the state space for continuous data. It introduces the Boltzmann-Shannon Index, defined as 1 minus the Jensen-Shannon Divergence between a frequency-based distribution of cluster labels and a geometry-based distribution derived from an SVD-based measure of each cluster's volume. The authors demonstrate that BSI rewards density-balanced partitions and penalizes misaligned geometry and frequency, with near-unity scores on Iris and meaningful sensitivity in synthetic and resource-allocation scenarios. The measure is differentiable and can be used as a regularizer in optimization, offering a practical tool for fair and balanced partitioning in continuous domains and complex dynamical systems.

Abstract

The Boltzmann-Shannon Index (BSI) for clustered continuous data is introduced as a normalized measure that captures the relationship between geometry-based and frequency-based probability distributions defined over the clusters. In essence, it quantifies the similarity across densities of the clusters, which are defined by a given labeling. This labeling may originate from a geometric partitioning of the state space itself, but need not in general. We illustrate its performance on synthetic Gaussian mixtures, the Iris benchmark data set, and a high-imbalance resource-allocation scenario, showing that the BSI provides a coherent assessment in cases where traditional metrics give incomplete or misleading signals. Moreover, in the resource-allocation setting where equal density may be associated with a "fair" distribution, we demonstrate that BSI not only detects inequality with high sensitivity, but also offers a numerically smooth measure that can be easily embedded in optimization frameworks as a regularization term for modern policy-making. Finally, the BSI also offers a new measure of the effectiveness for a given symbolic representation, i.e. coarse-grain states, for continuous-valued data recorded from complex dynamical systems.

Paper Structure

This paper contains 8 sections, 6 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Comparison between frequency-based and geometry-based quantifications of uncertainty in one-dimensional data. An example of (a) an underlying probability distribution $\rho(x)$ together with (b) a corresponding cumulative distribution function $CDF(x)$. A traditional Riemann sum (frequency-based) estimate for entropy uses (c) a distribution defined by the differences in the empirical CDF at fixed equal length bin boundaries; the probability $p_i$ of the $i$-th bin is given by the height of the corresponding blue step. In contrast, geometric partition entropy (GPE) constructs a measure-based representation as in (d) by partitioning the co-domain $[0,1]$ into equal masses and mapping the boundaries back through the inverse CDF; the probability $q_i$ is obtained from the geometric length $l_i$ taken as a proportion of the total state-space length. This figure was reproduced with the permission from the authors of diggans2023boltzmann.
  • Figure 2: Two-cluster reversal example. Population frequency $\textbf{p} = (\alpha,1-\alpha)$ and geometric spread $\textbf{q} = (1-\alpha,\alpha)$. The Boltzmann–Shannon Index (solid blue curve) attains its maximum value of 1 at $\alpha = 0.5$, corresponding to perfect frequency–geometry alignment, and decreases towards 0 as $\alpha\to 0^+$ (or $\alpha\to 1^-$), where the two distributions are completely inverted and nearly all points occupy the geometrically smaller (or larger) region.
  • Figure 3: Synthetic 2D Gaussian mixtures illustrating the behavior of the Boltzmann–Shannon Index. (a) balanced and well-separated clusters (high BSI); (b) moderately imbalanced sizes (intermediate BSI); and (c) strongly imbalanced and overlapping clusters (low BSI).
  • Figure 4: Boltzmann–Shannon Index as a function of the fairness parameter $\beta$ for population shares 95.0 %,4.9 %,0.1 %. The index is maximal ($\approx 0.98$) under strictly proportional allocations ($\beta = +1$) and collapses to nearly zero when resources are inverted toward the smallest community ($\beta = -1$).