Measuring Dataset Diversity from a Geometric Perspective
Yang Ba, Mohammad Sadeq Abolhasani, Michelle V Mancenido, Rong Pan
TL;DR
The paper tackles the problem of measuring dataset diversity beyond entropy by linking diversity to the geometric and topological structure of data. It introduces PLDiv, a persistence-landscape based diversity metric grounded in topological data analysis, with a closed-form expression that sums squared lifetimes of 0D persistence features. The authors prove that PLDiv satisfies key diversity axioms and demonstrate its effectiveness across synthetic geometries, curvature data, text embeddings, and image embeddings, outperforming several baselines in geometry-sensitive tasks. They also address computation through sparse filtrations and MST-based strategies, making PLDiv scalable to large datasets. Overall, PLDiv provides a principled, geometry-aware foundation for dataset construction, augmentation, and evaluation with broad applicability across modalities.
Abstract
Diversity can be broadly defined as the presence of meaningful variation across elements, which can be viewed from multiple perspectives, including statistical variation and geometric structural richness in the dataset. Existing diversity metrics, such as feature-space dispersion and metric-space magnitude, primarily capture distributional variation or entropy, while largely neglecting the geometric structure of datasets. To address this gap, we introduce a framework based on topological data analysis (TDA) and persistence landscapes (PLs) to extract and quantify geometric features from data. This approach provides a theoretically grounded means of measuring diversity beyond entropy, capturing the rich geometric and structural properties of datasets. Through extensive experiments across diverse modalities, we demonstrate that our proposed PLs-based diversity metric (PLDiv) is powerful, reliable, and interpretable, directly linking data diversity to its underlying geometry and offering a foundational tool for dataset construction, augmentation, and evaluation.
