The Vendi Score: A Diversity Evaluation Metric for Machine Learning

Dan Friedman; Adji Bousso Dieng

The Vendi Score: A Diversity Evaluation Metric for Machine Learning

Dan Friedman, Adji Bousso Dieng

TL;DR

The paper introduces the Vendi Score, a flexible, reference-free diversity metric grounded in the exponential of the Shannon entropy of the eigenvalues of a sample's similarity kernel. By tying the measure to the kernel's eigenstructure, the VS captures the effective number of dissimilar elements and accounts for feature correlations, with efficient computation when embeddings are available. The authors validate VS across synthetic data, molecules, images, and text, showing it often aligns with, yet can surpass, traditional diversity metrics in detecting mode coverage and diversity nuances. They provide theoretical properties, connections to spectral methods and DPPs, and thorough implementation details, demonstrating VS as a practical tool for diversity-informed data curation and model evaluation. This framework enables domain-agnostic, similarity-driven diversity assessment and offers insights for improving data augmentation and model development.

Abstract

Diversity is an important criterion for many areas of machine learning (ML), including generative modeling and dataset curation. However, existing metrics for measuring diversity are often domain-specific and limited in flexibility. In this paper, we address the diversity evaluation problem by proposing the Vendi Score, which connects and extends ideas from ecology and quantum statistical mechanics to ML. The Vendi Score is defined as the exponential of the Shannon entropy of the eigenvalues of a similarity matrix. This matrix is induced by a user-defined similarity function applied to the sample to be evaluated for diversity. In taking a similarity function as input, the Vendi Score enables its user to specify any desired form of diversity. Importantly, unlike many existing metrics in ML, the Vendi Score does not require a reference dataset or distribution over samples or labels, it is therefore general and applicable to any generative model, decoding algorithm, and dataset from any domain where similarity can be defined. We showcase the Vendi Score on molecular generative modeling where we found it addresses shortcomings of the current diversity metric of choice in that domain. We also applied the Vendi Score to generative models of images and decoding algorithms of text where we found it confirms known results about diversity in those domains. Furthermore, we used the Vendi Score to measure mode collapse, a known shortcoming of generative adversarial networks (GANs). In particular, the Vendi Score revealed that even GANs that capture all the modes of a labeled dataset can be less diverse than the original dataset. Finally, the interpretability of the Vendi Score allowed us to diagnose several benchmark ML datasets for diversity, opening the door for diversity-informed data augmentation.

The Vendi Score: A Diversity Evaluation Metric for Machine Learning

TL;DR

Abstract

Paper Structure (40 sections, 5 theorems, 9 equations, 13 figures, 5 tables)

This paper contains 40 sections, 5 theorems, 9 equations, 13 figures, 5 tables.

Introduction
Are We Measuring Diversity Correctly in ML?
Measuring Diversity with the Vendi Score
Defining the Vendi Score
Understanding the Vendi Score
Calculating the Vendi Score
Sample complexity.
Connections to Other Areas in ML
Experiments
Synthetic experiments
Evaluating molecular generative models for diversity
Assessing mode collapse in GANs
Evaluating image generative models for diversity
Evaluating decoding algorithms for text for diversity
Diagnosing datasets for diversity
...and 25 more sections

Key Result

Lemma 3.1

Consider the same setting as Definition def:vendi_score. Then

Figures (13)

Figure 1: (a) The Vendi Score, VS in the figure, can be interpreted as the effective number of unique elements in a sample. It increases linearly with the number of modes in the dataset. IntDiv, the expected dissimilarity, becomes less sensitive as the number of modes increases, converging to 1. (b) Combining distinct similarity functions can increase the Vendi Score, as should be expected of a diversity metric, while leaving IntDiv unchanged. (c) IntDiv does not take into account correlations between features, but the Vendi Score does. The Vendi Score is highest when the items in the sample differ in many attributes, and the attributes are not correlated with each other.
Figure 2: ${{\text{VS}}}$ increases proportionally with diversity in three sets of synthetic datasets. In each row, we sample datasets from univariate mixture-of-normal distributions, varying either the number of components, the mixture proportions, or the per-component variance. The datasets are depicted in the left, as histograms, and the diversity scores are plotted on the right.
Figure 3: The kernel matrices for $250$ molecules sampled from the HMM, AAE, and the original dataset, sorted lexicographically by SMILES string representation. The samples have similar IntDiv scores, but the HMM samples score much lower on VS. The figure shows that the HMM generates a number of exact duplicates. VS is able to capture the HMM's lack of diversity while IntDiv cannot.
Figure 4: The categories in CIFAR-100 with the lowest and highest VS, defining similarity as the cosine similarity between either Inception embeddings or pixel vectors. We show $100$ examples from each category, in decreasing order of average similarity, with the image at the top left having the highest average similarity scores according to the corresponding kernel.
Figure 5: Detecting mode dropping in image and text datasets. We evaluate VS and IntDiv on datasets containing $500$ examples drawn uniformly from between one and ten classes: digits in MNIST and sentences genres in MultiNLI. Compared to IntDiv, VS increases more consistently with the number of classes.
...and 8 more figures

Theorems & Definitions (10)

Definition 3.1: Vendi Score
Lemma 3.1
Theorem 3.1: Properties of the Vendi Score
Definition 7.1: Probability-Weighted Vendi Score
Lemma 7.1
Lemma
proof
Lemma
proof
proof

The Vendi Score: A Diversity Evaluation Metric for Machine Learning

TL;DR

Abstract

The Vendi Score: A Diversity Evaluation Metric for Machine Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (10)