Variational Search Distributions

Daniel M. Steinberg; Rafael Oliveira; Cheng Soon Ong; Edwin V. Bonilla

Variational Search Distributions

Daniel M. Steinberg, Rafael Oliveira, Cheng Soon Ong, Edwin V. Bonilla

TL;DR

Steinberg et al. address batch active generation for rare, desirable designs in combinatorial spaces by learning a conditional generative model. They introduce Variational Search Distributions (VSD), a variational-inference framework that approximates the level-set posterior $p(\mathbf{x}|y>\tau)$ with a parameterized $q(\mathbf{x}|\boldsymbol{\phi})$ and uses ELBO optimization with either a GP-PI surrogate or an NN-based class-probability estimator. The paper provides asymptotic convergence guarantees for the learned distribution under GP and NTK-based neural models and demonstrates superior performance over baselines on handwritten-digit conditioning and real sequence-design tasks (DHFR, TrpB, TFBIND8, Ehrlich, GFP, AAV). The results show VSD scales to high-dimensional, discrete design spaces and effectively guides batch experiments, indicating practical impact for protein/DNA/RNA engineering and other combinatorial design problems.

Abstract

We develop VSD, a method for conditioning a generative model of discrete, combinatorial designs on a rare desired class by efficiently evaluating a black-box (e.g. experiment, simulation) in a batch sequential manner. We call this task active generation; we formalize active generation's requirements and desiderata, and formulate a solution via variational inference. VSD uses off-the-shelf gradient based optimization routines, can learn powerful generative models for desirable designs, and can take advantage of scalable predictive models. We derive asymptotic convergence rates for learning the true conditional generative distribution of designs with certain configurations of our method. After illustrating the generative model on images, we empirically demonstrate that VSD can outperform existing baseline methods on a set of real sequence-design problems in various protein and DNA/RNA engineering tasks.

Variational Search Distributions

TL;DR

with a parameterized

and uses ELBO optimization with either a GP-PI surrogate or an NN-based class-probability estimator. The paper provides asymptotic convergence guarantees for the learned distribution under GP and NTK-based neural models and demonstrates superior performance over baselines on handwritten-digit conditioning and real sequence-design tasks (DHFR, TrpB, TFBIND8, Ehrlich, GFP, AAV). The results show VSD scales to high-dimensional, discrete design spaces and effectively guides batch experiments, indicating practical impact for protein/DNA/RNA engineering and other combinatorial design problems.

Abstract

Paper Structure (39 sections, 19 theorems, 113 equations, 13 figures, 4 tables, 1 algorithm)

This paper contains 39 sections, 19 theorems, 113 equations, 13 figures, 4 tables, 1 algorithm.

Introduction
Method
The Problem of Active Generation
Variational Search Distributions
Class Probability Estimation
Theoretical Analysis
Related Work
Experiments
Conditional Generation of Handwritten Digits
Fitness Landscapes
Black-Box Optimization
Conclusion
Acronyms
Depiction of Active Generation
Experimental Details
...and 24 more sections

Key Result

theorem 2.1

Under mild assumptions (a:gp to a:prior), the variational distribution of equipped with - converges to the level-set distribution in probability at the following rate:

Figures (13)

Figure 1: Fitness landscape properties and models. (\ref{['sfig:argmaxx']}) A noise-less fitness landscape, ${f\!\centerdot}(\mathbf{x})$, and the maximum fitness design, $\mathcal{S}_\textrm{BBO} = \{\mathbf{x}^*\}$, as the white '$\times$'. (\ref{['sfig:fitset']}) The super level-set, $\mathcal{S}_\textrm{SLS}$, of all fit designs as the white hatched area. (\ref{['sfig:px']}) Prior belief $p({\mathbf{x}})$. (\ref{['sfig:pxgy']}) The density/mass function of the super level-set, $p({\mathbf{x}}|{y > \tau})$, as blue contours. Our goal is to sequentially estimate a generative model for the distribution of the super level-set (\ref{['sfig:pxgy']}). We assume a noisy relationship between ${f\!\centerdot}$ and $y$, so the super level-set will not have a hard boundary, and $p({\mathbf{x}}|{y > \tau})$ will be defined over all $\mathcal{X}$.
Figure 2: (\ref{['sfig:cde_lstm_p']}) and (\ref{['sfig:cde_dtfm_p']}) are samples from the LSTM and transformer priors, respectively. (\ref{['sfig:cde_lstm']}) and (\ref{['sfig:cde_dtfm']}) show samples from the and transformer variational distributions respectively. We also report the samples mean scores according to the probabilities.
Figure 3: Fitness landscape results. Precision (\ref{['eq:prec']}), recall (\ref{['eq:rec']}) and performance (\ref{['eq:perf']}) -- higher is better -- for the combinatorially (near) complete datasets, DHFR, TrpB and TFBIND8. The random method is implemented by drawing $B$ samples uniformly.
Figure 4: AAV & GFP results. Simple regret (\ref{['eq:simregret']}) -- lower is better -- on GFP and AAV with independent and auto-regressive variational distributions. The and AdaLead results are replicated between the plots, since they are unaffected by choice of variational distribution.
Figure 5: Ehrlich function (poli implementation) results. and with different variational distributions; mean field (MF), and transformer (TFM), compared against genetic algorithm (GA) and LaMBO-2 baselines.
...and 8 more figures

Theorems & Definitions (32)

theorem 2.1
corollary 2.1
corollary 2.2
lemma E.1: chowdhury2017
lemma E.2: Second Borel-Cantelli lemma
lemma E.3
proof
lemma E.4
proof
lemma E.5
...and 22 more

Variational Search Distributions

TL;DR

Abstract

Variational Search Distributions

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (32)