Cousins Of The Vendi Score: A Family Of Similarity-Based Diversity Metrics For Science And Machine Learning

Amey P. Pasarkar; Adji Bousso Dieng

Cousins Of The Vendi Score: A Family Of Similarity-Based Diversity Metrics For Science And Machine Learning

Amey P. Pasarkar, Adji Bousso Dieng

TL;DR

The paper proposes a family of similarity-based diversity metrics, the Vendi scores, parameterized by order $q$, to overcome limitations of Hill numbers that require known prevalence and ignore item similarity. A central Similarity-Eigenvalue-Prevalence Theorem links the eigenvalues of a normalized similarity matrix to item prevalence, enabling Hill-number-like behavior without prevalence knowledge and allowing generalization to any $q$. Empirically, the authors show that different $q$ values trade off sensitivity to rare items versus duplicates, with $q\!=\infty$ enhancing molecular mixing in Vendi Sampling and high $\text{VS}_{\infty}$ correlating with memorization in image-generative models; smaller $q$ emphasizes diversity across rare states. The work provides guidance for aligning $q$ with application goals, demonstrates differentiability for optimization, and highlights the potential of these metrics to evaluate and control diversity across domains from molecular simulations to generative AI, while noting kernel choice and scalability as important practical considerations.

Abstract

Measuring diversity accurately is important for many scientific fields, including machine learning (ML), ecology, and chemistry. The Vendi Score was introduced as a generic similarity-based diversity metric that extends the Hill number of order q=1 by leveraging ideas from quantum statistical mechanics. Contrary to many diversity metrics in ecology, the Vendi Score accounts for similarity and does not require knowledge of the prevalence of the categories in the collection to be evaluated for diversity. However, the Vendi Score treats each item in a given collection with a level of sensitivity proportional to the item's prevalence. This is undesirable in settings where there is a significant imbalance in item prevalence. In this paper, we extend the other Hill numbers using similarity to provide flexibility in allocating sensitivity to rare or common items. This leads to a family of diversity metrics -- Vendi scores with different levels of sensitivity -- that can be used in a variety of applications. We study the properties of the scores in a synthetic controlled setting where the ground truth diversity is known. We then test their utility in improving molecular simulations via Vendi Sampling. Finally, we use the Vendi scores to better understand the behavior of image generative models in terms of memorization, duplication, diversity, and sample quality.

Cousins Of The Vendi Score: A Family Of Similarity-Based Diversity Metrics For Science And Machine Learning

TL;DR

The paper proposes a family of similarity-based diversity metrics, the Vendi scores, parameterized by order

, to overcome limitations of Hill numbers that require known prevalence and ignore item similarity. A central Similarity-Eigenvalue-Prevalence Theorem links the eigenvalues of a normalized similarity matrix to item prevalence, enabling Hill-number-like behavior without prevalence knowledge and allowing generalization to any

. Empirically, the authors show that different

values trade off sensitivity to rare items versus duplicates, with

enhancing molecular mixing in Vendi Sampling and high

correlating with memorization in image-generative models; smaller

emphasizes diversity across rare states. The work provides guidance for aligning

with application goals, demonstrates differentiability for optimization, and highlights the potential of these metrics to evaluate and control diversity across domains from molecular simulations to generative AI, while noting kernel choice and scalability as important practical considerations.

Abstract

Paper Structure (14 sections, 2 theorems, 9 equations, 7 figures)

This paper contains 14 sections, 2 theorems, 9 equations, 7 figures.

INTRODUCTION
RELATED WORK
HILL NUMBERS AND ECOLOGICAL DIVERSITY
COUSINS OF THE VENDI SCORE: EXTENDING HILL NUMBERS USING SIMILARITY
EMPIRICAL STUDY
Application To Vendi Sampling
Application To Generative Models
DISCUSSION
CONCLUSION
Appendix
Proof of Theorem 4.1
Vendi Sampling: Alanine Dipeptide Experimental Details
Vendi Sampling: Double Well System
Image Generative Model Analysis

Key Result

Theorem 4.1

[The Similarity-Eigenvalue-Prevalence Theorem] Let $({\mathbf{x}}_1, \dots, {\mathbf{x}}_N)$ denote a collection of elements, where each ${\mathbf{x}}_i = ({\mathbf{x}}_{i1}, \dots, {\mathbf{x}}_{iM_i})$ contains a unique element repeated $M_i$ times, i.e. ${\mathbf{x}}_{ij} = {\mathbf{x}}_{ik}$ for

Figures (7)

Figure 1: Sensitivity of Different Vendi Scores Under Different Scenarios. (A) Varying the number of classes under perfect balance. Each Vendi score measures the number of classes exactly; they are effective numbers. (B) Varying the number of classes under imbalance. Smaller orders $q$ more accurately describe the correct number of modes. (C) Combining two similarity functions for shape and color. All choices of order $q$ except $q=\infty$ give increases in diversity with the similarity composition. (D) Varying the correlation of shape and color features. As the correlation between shape and color decreases from left to right, all $q$ except $q=\infty$ yield larger Vendi scores. (E) Decreasing the similarity between class members. $q=\infty$ gives a Vendi Score that is more resistant to intra-class variance. For smaller $q$s, the Vendi scores increase with larger amounts of variance, although the Vendi score with $q=0.1$ decreases slightly between example $S_2$ and $S_3$.
Figure 2: Sensitivity of different Vendi scores to missing Alanine Dipeptide conformation. Left: Ramachandran plot from an unbiased simulation of Alanine Dipeptide plotted against the two dihedral angles $\phi,\psi$. Center: Ramachandran plot after removing the left-handed conformation. Right: The percent difference for different Vendi scores between samples from the original simulation and samples missing the left-handed state. Vendi scores are calculated using $20,000$ molecules from each set of samples using an invariant RBF Kernel with $\gamma=1$.
Figure 3: Behavior of the Vendi scores for sampling Alanine Dipeptide. Left: Convergence of Vendi sampling under different scores over $25$ns of simulation to the free energy difference estimated from long unbiased simulations (dashed gray line). Right: Number of transitions for each score in and out of the left-handed state over the course of the first $50$ps of simulation. Shaded regions represent uncertainty over $10$ trials.
Figure 4: Vendi scores correlate strongly with human evaluation and memorization scores on CIFAR-10 and Imagenet256. Left: Human classification error rate vs. Vendi Score$_\infty$ for models trained on CIFAR-10 (Top) and Imagenet256 (Bottom). Right: $C_T$-modified vs. Vendi Score$_\infty$ for models trained on CIFAR-10 (Top) and Imagenet256 (Bottom).
Figure 5: Pearson correlations between metrics averaged across four training datasets. $C_T$-modified is computed only on CIFAR-10 and Imagenet256. Vendi scores of large order $q$ correlate strongly with various metrics for evaluating generative models.
...and 2 more figures

Theorems & Definitions (4)

Theorem 4.1
proof
Theorem 8.1: The Similarity-Eigenvalue-Prevalence Theorem
proof

Cousins Of The Vendi Score: A Family Of Similarity-Based Diversity Metrics For Science And Machine Learning

TL;DR

Abstract

Cousins Of The Vendi Score: A Family Of Similarity-Based Diversity Metrics For Science And Machine Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (4)