Table of Contents
Fetching ...

Quality-Weighted Vendi Scores And Their Application To Diverse Experimental Design

Quan Nguyen, Adji Bousso Dieng

TL;DR

The paper tackles the tendency of experimental-design algorithms to over-exploit and under-explore by introducing quality-weighted diversity through the qVS. It extends the Vendi Score by incorporating item quality, and generalizes diversity control with a Rényi-like order parameter $q$, enabling a smooth quality–diversity trade-off. By embedding qVS in active search and Bayesian optimization, the authors demonstrate improved discovery of diverse, high-quality data across molecular, material, and RL-inspired tasks, with substantial performance gains over strong baselines. This approach provides a principled, scalable alternative to hard-diversity constraints, improving practical outcomes in expensive discovery pipelines.

Abstract

Experimental design techniques such as active search and Bayesian optimization are widely used in the natural sciences for data collection and discovery. However, existing techniques tend to favor exploitation over exploration of the search space, which causes them to get stuck in local optima. This ``collapse" problem prevents experimental design algorithms from yielding diverse high-quality data. In this paper, we extend the Vendi scores -- a family of interpretable similarity-based diversity metrics -- to account for quality. We then leverage these quality-weighted Vendi scores to tackle experimental design problems across various applications, including drug discovery, materials discovery, and reinforcement learning. We found that quality-weighted Vendi scores allow us to construct policies for experimental design that flexibly balance quality and diversity, and ultimately assemble rich and diverse sets of high-performing data points. Our algorithms led to a 70%-170% increase in the number of effective discoveries compared to baselines.

Quality-Weighted Vendi Scores And Their Application To Diverse Experimental Design

TL;DR

The paper tackles the tendency of experimental-design algorithms to over-exploit and under-explore by introducing quality-weighted diversity through the qVS. It extends the Vendi Score by incorporating item quality, and generalizes diversity control with a Rényi-like order parameter , enabling a smooth quality–diversity trade-off. By embedding qVS in active search and Bayesian optimization, the authors demonstrate improved discovery of diverse, high-quality data across molecular, material, and RL-inspired tasks, with substantial performance gains over strong baselines. This approach provides a principled, scalable alternative to hard-diversity constraints, improving practical outcomes in expensive discovery pipelines.

Abstract

Experimental design techniques such as active search and Bayesian optimization are widely used in the natural sciences for data collection and discovery. However, existing techniques tend to favor exploitation over exploration of the search space, which causes them to get stuck in local optima. This ``collapse" problem prevents experimental design algorithms from yielding diverse high-quality data. In this paper, we extend the Vendi scores -- a family of interpretable similarity-based diversity metrics -- to account for quality. We then leverage these quality-weighted Vendi scores to tackle experimental design problems across various applications, including drug discovery, materials discovery, and reinforcement learning. We found that quality-weighted Vendi scores allow us to construct policies for experimental design that flexibly balance quality and diversity, and ultimately assemble rich and diverse sets of high-performing data points. Our algorithms led to a 70%-170% increase in the number of effective discoveries compared to baselines.
Paper Structure (14 sections, 12 equations, 6 figures, 2 tables, 3 algorithms)

This paper contains 14 sections, 12 equations, 6 figures, 2 tables, 3 algorithms.

Figures (6)

  • Figure 1: Batches of 10 data points maximizing various VS and qVS functions, obtained with multi-start gradient-based optimization. The scoring function is a Gaussian function centered at the middle point, as illustrated by the heat maps. The quality-weighted Vendi Score balances between the quality of the selected data points and their diversity; this balance is smoothly controlled by the order $q$.
  • Figure 2: Data points collected by diversity-blind search and our diversity-aware policy in the materials discovery problem with bulk metal glasses. Our method appropriately balances between exploring the search space and focusing on regions containing positive data, and discovers more effective positives as a result.
  • Figure 3: Average optimization performance and standard errors across 10 repeated experiments. Our method (shown in red) performs competitively across the different settings.
  • Figure 4: Trajectories identified by various search methods in the rover path finding problem. Our method finds a diverse set of paths, whose diversity can be controlled using the order $q$ of the qVS.
  • Figure 4: Average storage capacity values (higher is better) and standard errors of the best MOFs found by our algorithm under different orders $q$. Here, $q = 0$ corresponds to regular, diversity-blind Bayesian optimization.
  • ...and 1 more figures