Table of Contents
Fetching ...

$\textit{sentropy}$: A Python Package for Revealing Hidden Differences in Complex Datasets

Phuc Nguyen, Rohit Arora, Elliot D. Hill, Jasper Braun, Alexandra Morgan, Liza M. Quintana, Gabrielle Mazzoni, Ghee Rye Lee, Rima Arnaout, Ramy Arnaout

TL;DR

The paper presents sentropy, a Python package that democratizes similarity-sensitive diversity analysis by unifying Hill-number–based measures with a tunable similarity matrix. It formalizes D_q and its similarity-weighted counterpart D_q^Z, along with derived metrics such as representativeness and ordinariness, and introduces normalized variants to aid interpretation. Through immunomics, metagenomics, and medical-imaging examples, the authors demonstrate that S-entropy captures structure and overlap without rigid binning and scales to large datasets via on-the-fly computation and GPU acceleration. The work provides practical tooling for dataset curation, cross-dataset comparison, and robust assessment of diversity that complements traditional richness and entropy metrics, with broad implications for ML data quality and interpretability.

Abstract

Machine-learning datasets are typically characterized by measuring their size and class balance. However, there exists a richer and potentially more useful set of measures, termed S-entropy (similarity-sensitive entropy), that incorporate elements' frequencies and between-element similarities. Although these have been available in the R and Julia programming languages for other applications, they have not been as readily available in Python, which is widely used for machine learning, and are not easily applied to machine-learning-sized datasets without special coding considerations. To address these issues, we developed $\textit{sentropy}$, a Python package that calculates S-entropy and is tailored to large datasets. $\textit{sentropy}$ can calculate any of the frequency-sensitive measures of Hill's D-number framework and their similarity-sensitive counterparts. $\textit{sentropy}$ also outputs measures that compare datasets. We first briefly review S-entropy, illustrating how it incorporates elements' frequencies and elements' pairwise similarities. We then describe $\textit{sentropy}$'s key features and usage. We end with several examples - immunomics, metagenomics, computational pathology, and medical imaging - illustrating $\textit{sentropy}$'s applicability across a range of dataset types and fields.

$\textit{sentropy}$: A Python Package for Revealing Hidden Differences in Complex Datasets

TL;DR

The paper presents sentropy, a Python package that democratizes similarity-sensitive diversity analysis by unifying Hill-number–based measures with a tunable similarity matrix. It formalizes D_q and its similarity-weighted counterpart D_q^Z, along with derived metrics such as representativeness and ordinariness, and introduces normalized variants to aid interpretation. Through immunomics, metagenomics, and medical-imaging examples, the authors demonstrate that S-entropy captures structure and overlap without rigid binning and scales to large datasets via on-the-fly computation and GPU acceleration. The work provides practical tooling for dataset curation, cross-dataset comparison, and robust assessment of diversity that complements traditional richness and entropy metrics, with broad implications for ML data quality and interpretability.

Abstract

Machine-learning datasets are typically characterized by measuring their size and class balance. However, there exists a richer and potentially more useful set of measures, termed S-entropy (similarity-sensitive entropy), that incorporate elements' frequencies and between-element similarities. Although these have been available in the R and Julia programming languages for other applications, they have not been as readily available in Python, which is widely used for machine learning, and are not easily applied to machine-learning-sized datasets without special coding considerations. To address these issues, we developed , a Python package that calculates S-entropy and is tailored to large datasets. can calculate any of the frequency-sensitive measures of Hill's D-number framework and their similarity-sensitive counterparts. also outputs measures that compare datasets. We first briefly review S-entropy, illustrating how it incorporates elements' frequencies and elements' pairwise similarities. We then describe 's key features and usage. We end with several examples - immunomics, metagenomics, computational pathology, and medical imaging - illustrating 's applicability across a range of dataset types and fields.
Paper Structure (34 sections, 4 equations, 8 figures, 4 tables)

This paper contains 34 sections, 4 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: How frequency affects diversity (entropy). Each dataset contains six unique elements, so $D_0=6$ for each dataset (the "0" in $D_0$ makes $D_0$ frequency insensitive). Thus, if we ignore frequency, the two datasets are equally diverse. However, Dataset 1a is mostly apples, while Dataset 1b has a nearly uniform distribution of different fruits. Thus, Dataset 1b is intuitively more diverse. Consistent with intuition, $D_1$ is only 1.9 for Dataset 1a but nearly 6 for Dataset 2: for $q=1$, Dataset 1a effectively has only 1.9 unique elements: one can think of this as the apple counting as one unique element and the other fruits in Dataset 1a collectively counting as nearly (0.9) the equivalent of a second unique element. Up-weighting common elements can also be thought of discounting rare ones; thus, different values of $q$ can be thought of as different discount factors, and therefore lead to different effective numbers. For example, for $q=\infty$, $D_\infty=1.17$ for Dataset 1a and 5.83 for Dataset 2: with maximal discounting, Dataset 1a effectively has only fractionally more than a single element (effectively nearly all apples), whereas Dataset 1b still has close to all six (with a slight discount because there are only five bunches of grapes vs. six of each of the other fruits). The sentropy package can calculate any frequency-sensitive diversity measure.
  • Figure 2: How similarity affects diversity. Each dataset contains nine species (top), each at equal frequencies, so $D_0=D_1=D_2=\dots=D_\infty=9$ for each dataset. Thus, the two datasets are equally diverse using similarity insensitive measures. However, Dataset 2a is all birds, whereas Dataset 2b contains a wider variety of animals, making Dataset 2b intuitively more diverse. The similarity matrices $Z$ are accordingly quite different (bottom). Consistent with the intuition, $D_0^Z$ is only 1.06 for Dataset 2a, but 2.16 for Dataset 2b: Dataset 2a effectively has only 1.06 species (further interpreted as it contains just a single class or type—here, birds; the "extra" 0.06 reflects intragroup diversity among bird species). Meanwhile, Dataset 2b effectively has 2.16 species, corresponding roughly to the number of taxonomic phyla represented (Arthropoda and Chordata, which form two visible clusters in the similarity matrix). The sentropy package can calculate any similarity-sensitive diversity measure for any user-supplied definition of similarity.
  • Figure 3: Immunomics: Similarity-insensitive vs. -sensitive measures in influenza vaccination. Diversity of IGH CDR3 immunomes according to the similarity-insensitive measure $D_0$ (left) and its similarity-sensitive counterpart $D_0^Z$ (middle), and a comparison of the two (right) before vs. after influenza vaccination. Left and middle: light/dotted lines denote subjects where diversity falls. Right: each line/pair of symbols show $D_0$ and $D_0^Z$ for the same subject. Dark line shows the one subject where vaccination was associated with fewer, more different sequences. Dashes in the margins indicate averages of $D_0$ and $D_0^Z$.
  • Figure 4: Metagenomics: Traditional vs. S-entropy alpha measures in synthetic samples. Sample 1 (a) and 2 (b) both have 1,000 sequences. For ease of illustration, distance along the x-axis is proportional to similarity (i.e. more similar sequences are nearer each other). In Sample 1, sequences form five highly distinct clusters. In Sample 2, sequences form 10 more similar clusters. The distribution of sequences within each cluster is Gaussian, consistent with observations from the Human Microbiome Project [11]. Using binning to account for similarity and then measuring diversity using traditional/similarity-insensitive measures ($D_q$), diversity depends on the binning threshold $\tau$ and frequency weighting q. (c) At $\tau=97\%$ and $q=0$, the two samples are equally diverse, with $D_0=11$ species; At higher $q$, Sample 2 is more diverse, with diversities falling to $D_\infty=8.6$ vs. $D_\infty=5.6$ species, respectively, at $q=\infty$). (d) At $\tau=99\%$, both samples are much more diverse, with $D_0=32$ vs. 23 species, respectively. (Note that at this $\tau$, Sample 2 is more diverse than Sample 1 for all q.) (e) In contrast, accounting for similarity using S-entropy measures ($D_q^Z$), which avoids the need for binning, the order flips: Sample 1 is now more diverse than Sample 2 (for all q), with $\sim5$ vs. $\sim3-4$ species, respectively, reflecting both the number and the grouping of the clusters. (Here similarity $s_{ij}$ between sequences $i$ and $j$ is calculated as $s_{ij}=e^{-k\Delta_{ij}}$, where $\Delta_{ij}$ is the Levenshtein distance between sequences $i$ and $j$ and $k=0.02$.)
  • Figure 5: Metagenomics: Beta diversities to elucidate population structure in synthetic samples. (a) Nine synthetic samples created as in Fig. 4a-b, three representing each of three enterotypes: black, Enterotype 1 (Samples 1-1 to 1-3); gray, Enterotype 2 (Samples 2-1 to 2-3); white, Enterotype 3 (Samples 3-1 to 3-3). (b)-(c) Clustering using similarity $s_{ij}$ between sequences i and j as defined in Fig. \ref{['fig:4']} without binning, according to (b) the $q=0$ representativeness ($\bar{\rho}_0$) of the $k^\text{th}$ sample for the pair of samples indicated by the heatmap cell and (c) the average representativeness of each member of the pair ($\bar{R}_0$).
  • ...and 3 more figures