$\textit{sentropy}$: A Python Package for Revealing Hidden Differences in Complex Datasets
Phuc Nguyen, Rohit Arora, Elliot D. Hill, Jasper Braun, Alexandra Morgan, Liza M. Quintana, Gabrielle Mazzoni, Ghee Rye Lee, Rima Arnaout, Ramy Arnaout
TL;DR
The paper presents sentropy, a Python package that democratizes similarity-sensitive diversity analysis by unifying Hill-number–based measures with a tunable similarity matrix. It formalizes D_q and its similarity-weighted counterpart D_q^Z, along with derived metrics such as representativeness and ordinariness, and introduces normalized variants to aid interpretation. Through immunomics, metagenomics, and medical-imaging examples, the authors demonstrate that S-entropy captures structure and overlap without rigid binning and scales to large datasets via on-the-fly computation and GPU acceleration. The work provides practical tooling for dataset curation, cross-dataset comparison, and robust assessment of diversity that complements traditional richness and entropy metrics, with broad implications for ML data quality and interpretability.
Abstract
Machine-learning datasets are typically characterized by measuring their size and class balance. However, there exists a richer and potentially more useful set of measures, termed S-entropy (similarity-sensitive entropy), that incorporate elements' frequencies and between-element similarities. Although these have been available in the R and Julia programming languages for other applications, they have not been as readily available in Python, which is widely used for machine learning, and are not easily applied to machine-learning-sized datasets without special coding considerations. To address these issues, we developed $\textit{sentropy}$, a Python package that calculates S-entropy and is tailored to large datasets. $\textit{sentropy}$ can calculate any of the frequency-sensitive measures of Hill's D-number framework and their similarity-sensitive counterparts. $\textit{sentropy}$ also outputs measures that compare datasets. We first briefly review S-entropy, illustrating how it incorporates elements' frequencies and elements' pairwise similarities. We then describe $\textit{sentropy}$'s key features and usage. We end with several examples - immunomics, metagenomics, computational pathology, and medical imaging - illustrating $\textit{sentropy}$'s applicability across a range of dataset types and fields.
