Table of Contents
Fetching ...

The Vendiscope: An Algorithmic Microscope For Data Collections

Amey P. Pasarkar, Adji Bousso Dieng

TL;DR

The paper presents the Vendiscope, a computational tool that treats data collections as objects to be analyzed rather than merely modeled, addressing the problem of redundancy, bias, and memorization in large-scale datasets. It maximizes the probability-weighted Vendi Score (pVS) defined as $\text{pVS}_k(\mathbf{x}_1,\dots,\mathbf{x}_N,\mathbf{p}) = \exp\left(-\sum_{i=1}^{N} \eta_{ip} \log \eta_{ip}\right)$ (with a Rényi generalization $\text{pVS}_k = \exp\left(\frac{1}{1-q}\log \sum \eta_{ip}^q\right)$) by learning the data-point distribution $\mathbf{p}$ via gradient-based optimization to emphasize rare, diverse samples. Through scalable techniques (projective gradients, embeddings and cosine similarities to form $\mathbf{K}$, and parallel computation to achieve $O(d^2 n)$ complexity), the Vendiscope scales to hundreds of millions of items. In three domains, it reveals major redundancy and model weaknesses: in the protein universe (~$250$ million sequences) over $2\times 10^8$ are near-duplicates and AlphaFold struggles on diverse GO-function-rich sequences; in the Materials Project (~$1.7\times 10^5$ crystals) more than $85\%$ are near-duplicates and ML models falter on materials that heighten diversity; in CIFAR-10, memorization patterns emerge across $13$ generative models, with high-quality outputs often memorizing common training samples. Overall, the Vendiscope provides a unified, scalable framework for data auditing, de-duplication, and understanding how diversity shapes model behavior, enabling more robust data curation and AI ethics considerations.

Abstract

The evolution of microscopy, beginning with its invention in the late 16th century, has continuously enhanced our ability to explore and understand the microscopic world, enabling increasingly detailed observations of structures and phenomena. In parallel, the rise of data-driven science has underscored the need for sophisticated methods to explore and understand the composition of complex data collections. This paper introduces the Vendiscope, the first algorithmic microscope designed to extend traditional microscopy to computational analysis. The Vendiscope leverages the Vendi scores -- a family of differentiable diversity metrics rooted in ecology and quantum mechanics -- and assigns weights to data points based on their contribution to the overall diversity of the collection. These weights enable high-resolution data analysis at scale. We demonstrate this across biology, materials science, and machine learning (ML). We analyzed the $250$ million protein sequences in the protein universe, discovering that over $200$ million are near-duplicates and that AlphaFold fails on proteins with Gene Ontology (GO) functions that contribute most to diversity. Applying the Vendiscope to the Materials Project database led to similar findings: more than $85\%$ of the crystals with formation energy data are near-duplicates and ML models perform poorly on materials that enhance diversity. Additionally, the Vendiscope can be used to study phenomena such as memorization in generative models. We used the Vendiscope to identify memorized training samples from $13$ different generative models and found that the best-performing ones often memorize the training samples that contribute least to diversity. Our findings demonstrate that the Vendiscope can serve as a powerful tool for data-driven science.

The Vendiscope: An Algorithmic Microscope For Data Collections

TL;DR

The paper presents the Vendiscope, a computational tool that treats data collections as objects to be analyzed rather than merely modeled, addressing the problem of redundancy, bias, and memorization in large-scale datasets. It maximizes the probability-weighted Vendi Score (pVS) defined as (with a Rényi generalization ) by learning the data-point distribution via gradient-based optimization to emphasize rare, diverse samples. Through scalable techniques (projective gradients, embeddings and cosine similarities to form , and parallel computation to achieve complexity), the Vendiscope scales to hundreds of millions of items. In three domains, it reveals major redundancy and model weaknesses: in the protein universe (~ million sequences) over are near-duplicates and AlphaFold struggles on diverse GO-function-rich sequences; in the Materials Project (~ crystals) more than are near-duplicates and ML models falter on materials that heighten diversity; in CIFAR-10, memorization patterns emerge across generative models, with high-quality outputs often memorizing common training samples. Overall, the Vendiscope provides a unified, scalable framework for data auditing, de-duplication, and understanding how diversity shapes model behavior, enabling more robust data curation and AI ethics considerations.

Abstract

The evolution of microscopy, beginning with its invention in the late 16th century, has continuously enhanced our ability to explore and understand the microscopic world, enabling increasingly detailed observations of structures and phenomena. In parallel, the rise of data-driven science has underscored the need for sophisticated methods to explore and understand the composition of complex data collections. This paper introduces the Vendiscope, the first algorithmic microscope designed to extend traditional microscopy to computational analysis. The Vendiscope leverages the Vendi scores -- a family of differentiable diversity metrics rooted in ecology and quantum mechanics -- and assigns weights to data points based on their contribution to the overall diversity of the collection. These weights enable high-resolution data analysis at scale. We demonstrate this across biology, materials science, and machine learning (ML). We analyzed the million protein sequences in the protein universe, discovering that over million are near-duplicates and that AlphaFold fails on proteins with Gene Ontology (GO) functions that contribute most to diversity. Applying the Vendiscope to the Materials Project database led to similar findings: more than of the crystals with formation energy data are near-duplicates and ML models perform poorly on materials that enhance diversity. Additionally, the Vendiscope can be used to study phenomena such as memorization in generative models. We used the Vendiscope to identify memorized training samples from different generative models and found that the best-performing ones often memorize the training samples that contribute least to diversity. Our findings demonstrate that the Vendiscope can serve as a powerful tool for data-driven science.

Paper Structure

This paper contains 3 sections, 5 equations, 11 figures, 1 table, 2 algorithms.

Figures (11)

  • Figure 1: The rarest (top-scoring) proteins and the (low-scoring) proteins that contribute least to diversity, as identified by the Vendiscope, along with their corresponding AlphaFold predicted structures. Rare proteins are mostly uncharacterized or are biologically unrealistic. For example, one of the identified rare proteins misses the characteristic banana shape in the F-BAR domain. In contrast, bottom-scoring proteins are involved in fundamental pathways such as NAD(+) synthesis and transsulfuration.
  • Figure 2: AlphaFold confidence is worse on rare protein sequences. Left: Violin plot of average pLDDT for the top (most rare) and bottom (most common) $50,000$ sequences is shown. Right: Violin plot of AlphaFold confidences for proteins with certain GO functions. We select $10$ GO functions that are primarily present among low-scoring proteins ('Common GO') and $10$ GO functions that are enriched among high-scoring proteins ('Rare GO'). GO functions are shown in Figure \ref{['fig:TopBottomEnrichment']}.
  • Figure 3: Various selected Gene Ontology (GO) functions that are enriched among highly-ranked and low-ranked proteins. All displayed functions concentrated in rare proteins have roles in protein binding (GO:0005515), whereas all displayed functions in low-ranked proteins fall under amino acid metabolic processes (GO:0006520).
  • Figure 4: The Vendiscope identifies duplicates more accurately than MMseqs2, as demonstrated by the large protein clusters with consistent annotations it finds. Top: PCA scatter plot of all proteins originating from the ahcY gene, with duplicate clusters from the Vendiscope (left) and MMseqs2 (right). The $10$ clusters with the most proteins from the ahcY gene are shown for both methods. Bottom: PCA scatter plot of all proteins annotated with the GO:0003862 function (3-isopropylmalate dehydrogenase activity), with duplicate clusters from the Vendiscope (left) and MMseqs2 (right). The $10$ clusters with the most proteins containing this function are shown for both methods.
  • Figure 5: Property prediction worsens on rare materials across models. Top: Analysis of models trained to predict formation energy, showing larger predictive errors on the $500$ rarest materials according to the Vendiscope compared to the bottom $500$ materials. The rare materials correspond to those with fewer sites in their unit cells. Middle: Analysis of models trained to predict band gap on non-conducting materials. Predictive errors are higher for rare materials compared to common materials. Y-axis is logarithmic. The rare materials correspond to those with smaller band gaps. Bottom: Analysis of models trained on band gap prediction on conductors. Prediction errors are significantly higher for rare materials. For all models except ALIGNN, rare materials correspond to those with higher energies above the hull. All distributions are statistically distinct as measured by Mann-Whitney U Tests with p-values less than $0.01$.
  • ...and 6 more figures