Boltzmann-Shannon Index: A Geometric-Aware Measure of Clustering Balance
Emanuele Bossi, C. Tyler Diggans, Abd AlRahman R. AlMomani
TL;DR
This paper tackles the problem that standard clustering validity metrics fail to capture how well a partition reflects both the frequency of cluster occupancy and the underlying geometry of the state space for continuous data. It introduces the Boltzmann-Shannon Index, defined as 1 minus the Jensen-Shannon Divergence between a frequency-based distribution of cluster labels and a geometry-based distribution derived from an SVD-based measure of each cluster's volume. The authors demonstrate that BSI rewards density-balanced partitions and penalizes misaligned geometry and frequency, with near-unity scores on Iris and meaningful sensitivity in synthetic and resource-allocation scenarios. The measure is differentiable and can be used as a regularizer in optimization, offering a practical tool for fair and balanced partitioning in continuous domains and complex dynamical systems.
Abstract
The Boltzmann-Shannon Index (BSI) for clustered continuous data is introduced as a normalized measure that captures the relationship between geometry-based and frequency-based probability distributions defined over the clusters. In essence, it quantifies the similarity across densities of the clusters, which are defined by a given labeling. This labeling may originate from a geometric partitioning of the state space itself, but need not in general. We illustrate its performance on synthetic Gaussian mixtures, the Iris benchmark data set, and a high-imbalance resource-allocation scenario, showing that the BSI provides a coherent assessment in cases where traditional metrics give incomplete or misleading signals. Moreover, in the resource-allocation setting where equal density may be associated with a "fair" distribution, we demonstrate that BSI not only detects inequality with high sensitivity, but also offers a numerically smooth measure that can be easily embedded in optimization frameworks as a regularization term for modern policy-making. Finally, the BSI also offers a new measure of the effectiveness for a given symbolic representation, i.e. coarse-grain states, for continuous-valued data recorded from complex dynamical systems.
