Table of Contents
Fetching ...

Measuring Dataset Diversity from a Geometric Perspective

Yang Ba, Mohammad Sadeq Abolhasani, Michelle V Mancenido, Rong Pan

TL;DR

The paper tackles the problem of measuring dataset diversity beyond entropy by linking diversity to the geometric and topological structure of data. It introduces PLDiv, a persistence-landscape based diversity metric grounded in topological data analysis, with a closed-form expression that sums squared lifetimes of 0D persistence features. The authors prove that PLDiv satisfies key diversity axioms and demonstrate its effectiveness across synthetic geometries, curvature data, text embeddings, and image embeddings, outperforming several baselines in geometry-sensitive tasks. They also address computation through sparse filtrations and MST-based strategies, making PLDiv scalable to large datasets. Overall, PLDiv provides a principled, geometry-aware foundation for dataset construction, augmentation, and evaluation with broad applicability across modalities.

Abstract

Diversity can be broadly defined as the presence of meaningful variation across elements, which can be viewed from multiple perspectives, including statistical variation and geometric structural richness in the dataset. Existing diversity metrics, such as feature-space dispersion and metric-space magnitude, primarily capture distributional variation or entropy, while largely neglecting the geometric structure of datasets. To address this gap, we introduce a framework based on topological data analysis (TDA) and persistence landscapes (PLs) to extract and quantify geometric features from data. This approach provides a theoretically grounded means of measuring diversity beyond entropy, capturing the rich geometric and structural properties of datasets. Through extensive experiments across diverse modalities, we demonstrate that our proposed PLs-based diversity metric (PLDiv) is powerful, reliable, and interpretable, directly linking data diversity to its underlying geometry and offering a foundational tool for dataset construction, augmentation, and evaluation.

Measuring Dataset Diversity from a Geometric Perspective

TL;DR

The paper tackles the problem of measuring dataset diversity beyond entropy by linking diversity to the geometric and topological structure of data. It introduces PLDiv, a persistence-landscape based diversity metric grounded in topological data analysis, with a closed-form expression that sums squared lifetimes of 0D persistence features. The authors prove that PLDiv satisfies key diversity axioms and demonstrate its effectiveness across synthetic geometries, curvature data, text embeddings, and image embeddings, outperforming several baselines in geometry-sensitive tasks. They also address computation through sparse filtrations and MST-based strategies, making PLDiv scalable to large datasets. Overall, PLDiv provides a principled, geometry-aware foundation for dataset construction, augmentation, and evaluation with broad applicability across modalities.

Abstract

Diversity can be broadly defined as the presence of meaningful variation across elements, which can be viewed from multiple perspectives, including statistical variation and geometric structural richness in the dataset. Existing diversity metrics, such as feature-space dispersion and metric-space magnitude, primarily capture distributional variation or entropy, while largely neglecting the geometric structure of datasets. To address this gap, we introduce a framework based on topological data analysis (TDA) and persistence landscapes (PLs) to extract and quantify geometric features from data. This approach provides a theoretically grounded means of measuring diversity beyond entropy, capturing the rich geometric and structural properties of datasets. Through extensive experiments across diverse modalities, we demonstrate that our proposed PLs-based diversity metric (PLDiv) is powerful, reliable, and interpretable, directly linking data diversity to its underlying geometry and offering a foundational tool for dataset construction, augmentation, and evaluation.
Paper Structure (33 sections, 15 equations, 12 figures, 7 tables)

This paper contains 33 sections, 15 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Illustration of PLDiv on four synthetic datasets. D1: uniformly scattered points; D2: less evenly spread distribution; D3: two separated clusters; D4: a single compact cluster with minimal diversity. We extract $H_0$ features via persistent homology, where lifetimes measure how long clusters persist before merging with their closest neighbors. Persistence landscapes capture these patterns, and PLDiv, defined as the sum of their integrals, reflects both scale and persistence, aligning with the datasets’ decreasing diversity.
  • Figure 2: The pipeline of PLDiv. Using a data cloud or its distance matrix, we build a filtration of simplicial complexes and track the birth and death of $H_0$ components by persistent homology. The resulting persistence diagram is then used to calculate persistence landscapes. Lastly, PLDiv is obtained by integrating these landscapes and serves as a metric for the dataset diversity.
  • Figure 3: Synthetic dataset comparison. Upper: eight dataset pairs (A vs. B), each with 200 points, generated to introduce or remove loops, bridges, or hierarchical clusters. Lower: diversity scores across metrics. PLDiv yields sharper and more coherent distinctions that reflect the true geometric differences between datasets, while Vendi Score, DCScore, and MagArea respond mainly to overall spread and fail to capture these structural changes in most cases.
  • Figure 4: Demonstration that PLDiv achieves superior performance over alternative diversity metrics in predicting ground-truth diversity across tasks and embedding models. Points with different shapes denote different metric correlation scores, with error bars indicating standard deviations across 5 repeated cross-validation trials. Experiments with ABS-HDS exhibit larger error bars due to its smaller sample size.
  • Figure 5: PLDiv shows a near-perfect correlation with the amount of the class involved in the dataset and remains consistent across different embedding models. MAGAREA performs next best, followed by DCScore, which exhibits some fluctuations in performance. Vendi Score, however, fails to capture the underlying patterns in the data.
  • ...and 7 more figures

Theorems & Definitions (1)

  • proof