Table of Contents
Fetching ...

Which Similarity-Sensitive Entropy?

Phuc Nguyen, Josiah Couch, Rahul Bansal, Alexandra Morgan, Chris Tam, Miao Li, Rima Arnaout, Ramy Arnaout

TL;DR

The paper analyzes two similarity-sensitive entropy measures, the Leinster-Cobbold-Reeve framework (LCR) and the Vendi score (VS), to quantify information in datasets where inter-element similarities matter. By applying these measures to 53 ML datasets (imaging and tabular) and exploring how similarity scaling via a half-distance parameter k influences results, they show LCR and VS can provide complementary insights and can diverge by orders of magnitude, especially away from limiting regimes. The authors prove VS bounds LCR for several Rényi-Hill orders (q = 2, 3, ∞) and conjecture the bound holds for all q, while also highlighting practical advantages of LCR (e.g., not requiring PSD similarity matrices and computational efficiency). They conclude with guidance: use LCR as the default for capturing similarity-adjusted entropy, with VS useful in specific quantum-like interpretations or when elements are ur-element linear combinations, and they emphasize the value of jointly considering both metrics to obtain a richer, robust view of dataset information.

Abstract

A canonical step in quantifying a system is to measure its entropy. Shannon entropy and other traditional entropy measures capture only the information encoded in the frequencies of a system's elements. Recently, Leinster, Cobbold, and Reeve (LCR) introduced a method that also captures the rich information encoded in the similarities and differences among elements, yielding similarity-sensitive entropy. More recently, the Vendi score (VS) was introduced as an alternative, raising the question of how LCR and VS compare, and which is preferable. Here we address these questions conceptually, analytically, and experimentally, using 53 machine-learning datasets. We show that LCR and VS can differ by orders of magnitude and can capture complementary information about a system, except in limiting cases. We demonstrate that both LCR and VS depend on how similarities are scaled and introduce the concept of ``half distance'' to parameterize this dependence. We prove that VS provides an upper bound on LCR for several values of the Rényi-Hill order parameter and conjecture that this bound holds for all values. We conclude that VS is preferable only when interpreting elements as linear combinations of a more fundamental set of ``ur-elements'' or when the system or dataset possesses a quantum-mechanical character. In the broader circumstance where one seeks simply to capture the rich information encoded by similarity, LCR is favored; nevertheless, for certain half-distances the two methods can complement each other.

Which Similarity-Sensitive Entropy?

TL;DR

The paper analyzes two similarity-sensitive entropy measures, the Leinster-Cobbold-Reeve framework (LCR) and the Vendi score (VS), to quantify information in datasets where inter-element similarities matter. By applying these measures to 53 ML datasets (imaging and tabular) and exploring how similarity scaling via a half-distance parameter k influences results, they show LCR and VS can provide complementary insights and can diverge by orders of magnitude, especially away from limiting regimes. The authors prove VS bounds LCR for several Rényi-Hill orders (q = 2, 3, ∞) and conjecture the bound holds for all q, while also highlighting practical advantages of LCR (e.g., not requiring PSD similarity matrices and computational efficiency). They conclude with guidance: use LCR as the default for capturing similarity-adjusted entropy, with VS useful in specific quantum-like interpretations or when elements are ur-element linear combinations, and they emphasize the value of jointly considering both metrics to obtain a richer, robust view of dataset information.

Abstract

A canonical step in quantifying a system is to measure its entropy. Shannon entropy and other traditional entropy measures capture only the information encoded in the frequencies of a system's elements. Recently, Leinster, Cobbold, and Reeve (LCR) introduced a method that also captures the rich information encoded in the similarities and differences among elements, yielding similarity-sensitive entropy. More recently, the Vendi score (VS) was introduced as an alternative, raising the question of how LCR and VS compare, and which is preferable. Here we address these questions conceptually, analytically, and experimentally, using 53 machine-learning datasets. We show that LCR and VS can differ by orders of magnitude and can capture complementary information about a system, except in limiting cases. We demonstrate that both LCR and VS depend on how similarities are scaled and introduce the concept of ``half distance'' to parameterize this dependence. We prove that VS provides an upper bound on LCR for several values of the Rényi-Hill order parameter and conjecture that this bound holds for all values. We conclude that VS is preferable only when interpreting elements as linear combinations of a more fundamental set of ``ur-elements'' or when the system or dataset possesses a quantum-mechanical character. In the broader circumstance where one seeks simply to capture the rich information encoded by similarity, LCR is favored; nevertheless, for certain half-distances the two methods can complement each other.

Paper Structure

This paper contains 27 sections, 3 theorems, 28 equations, 8 figures, 1 table.

Key Result

Lemma 1

When $Z$ is full rank but otherwise as in conj: Vendi bounds LCR, for all $q\in [-\infty, 0]$

Figures (8)

  • Figure 1: The concepts of element, frequency distribution, traditional entropy (at $q=1$), similarity function, similarity measure ($z_ij$, with examples), similarity matrix ($Z$), and S-entropy (here, LCR, also at $q=1$). Entropy values are expressed in effective-number form, i.e. in units of effective number of images present in the dataset. Different similarity measures can be chosen (see Section \ref{['sec:k']}); in this example, the similarity measure is the normalized sum of the shared colors (0, 1, or 2) and shared features (outside color, inside color, and shape).
  • Figure 2: The first 10,000 images of the MNIST digits dataset colored (a) by digit (labeled using representative images), (b) by HDBSCAN cluster, and (c) by similarity to one of the images (a "9"). The effective number of images in this dataset is 12.5 by LCR and 95.9 by VS.
  • Figure 3: (a) Top 100 eigenimages for the MNIST digits dataset and (b) their eigenvalues (in the gray region), alongside the full eigenvalue spectrum. Note the log axes.
  • Figure 4: LCR and VS for (a) imaging and (b) tabular datasets, sorted by VS ($q=1$; default values of $k$ for each dataset type).
  • Figure 5: Correlations among LCR, VS, and HDBSCAN for (a) imaging and (b) tabular datasets. Each point represents a dataset. All entropies are effective-number forms at $q=1$. Red line = linear regression fit; gray line = 1:1 (off the scale to the left for (a), far right). Select datasets are labeled. (c) UMAPs for labeled imaging datasets. Gray points = unclustered elements.
  • ...and 3 more figures

Theorems & Definitions (9)

  • Conjecture 1
  • Lemma 1
  • proof
  • Theorem 1
  • proof
  • Conjecture 2
  • Conjecture 3
  • Theorem 2
  • proof