Table of Contents
Fetching ...

Autofocused oracles for model-based design

Clara Fannjiang, Jennifer Listgarten

TL;DR

The paper addresses how to design objects with desired properties when using data-driven proxies (oracles) that may be unreliable outside the training distribution. It reframes model-based design (MBD) as a non-gradient, distributional optimization and introduces autofocusing, a strategy that retrains the oracle in lockstep with the evolving search model to minimize the oracle gap between the ground-truth and oracle-based objectives. An alternating ascent-descent algorithm is proposed to reach a Nash equilibrium between the search model and the oracle, with theoretical insights on variance control and covariate shift. Empirically, autofocusing improves design performance in toy examples and a large-scale superconductors dataset, often yielding higher ground-truth objectives and better alignment between oracle and true outcomes. The work provides practical guidance for integrating adaptive oracle retraining into diverse MBO frameworks, with potential extensions to uncertainty estimation and model selection.

Abstract

Data-driven design is making headway into a number of application areas, including protein, small-molecule, and materials engineering. The design goal is to construct an object with desired properties, such as a protein that binds to a therapeutic target, or a superconducting material with a higher critical temperature than previously observed. To that end, costly experimental measurements are being replaced with calls to high-capacity regression models trained on labeled data, which can be leveraged in an in silico search for design candidates. However, the design goal necessitates moving into regions of the design space beyond where such models were trained. Therefore, one can ask: should the regression model be altered as the design algorithm explores the design space, in the absence of new data? Herein, we answer this question in the affirmative. In particular, we (i) formalize the data-driven design problem as a non-zero-sum game, (ii) develop a principled strategy for retraining the regression model as the design algorithm proceeds---what we refer to as autofocusing, and (iii) demonstrate the promise of autofocusing empirically.

Autofocused oracles for model-based design

TL;DR

The paper addresses how to design objects with desired properties when using data-driven proxies (oracles) that may be unreliable outside the training distribution. It reframes model-based design (MBD) as a non-gradient, distributional optimization and introduces autofocusing, a strategy that retrains the oracle in lockstep with the evolving search model to minimize the oracle gap between the ground-truth and oracle-based objectives. An alternating ascent-descent algorithm is proposed to reach a Nash equilibrium between the search model and the oracle, with theoretical insights on variance control and covariate shift. Empirically, autofocusing improves design performance in toy examples and a large-scale superconductors dataset, often yielding higher ground-truth objectives and better alignment between oracle and true outcomes. The work provides practical guidance for integrating adaptive oracle retraining into diverse MBO frameworks, with potential extensions to uncertainty estimation and model selection.

Abstract

Data-driven design is making headway into a number of application areas, including protein, small-molecule, and materials engineering. The design goal is to construct an object with desired properties, such as a protein that binds to a therapeutic target, or a superconducting material with a higher critical temperature than previously observed. To that end, costly experimental measurements are being replaced with calls to high-capacity regression models trained on labeled data, which can be leveraged in an in silico search for design candidates. However, the design goal necessitates moving into regions of the design space beyond where such models were trained. Therefore, one can ask: should the regression model be altered as the design algorithm explores the design space, in the absence of new data? Herein, we answer this question in the affirmative. In particular, we (i) formalize the data-driven design problem as a non-zero-sum game, (ii) develop a principled strategy for retraining the regression model as the design algorithm proceeds---what we refer to as autofocusing, and (iii) demonstrate the promise of autofocusing empirically.

Paper Structure

This paper contains 38 sections, 4 theorems, 21 equations, 6 figures, 4 tables, 3 algorithms.

Key Result

Proposition 1

For any search model, $p_\theta(\mathbf{x})$, if the oracle parameters, $\beta$, satisfy where $D_\text{KL}\infdivx{p}{q}$ is the Kullback-Leibler (KL) divergence between distributions $p$ and $q$, then the following bound holds:

Figures (6)

  • Figure 1: Illustrative example. Panels (a-d) show detailed snapshots of the MBO algorithm, CbAS Brookes2019-vw, with and without autofocusing (AF) in each panel. The vertical axis represents both $y$ values (for the oracle and ground truth) and probability density values (of the training distribution, $p_0(\mathbf{x})$, and search distributions, $p_{\theta^{(t)}}(\mathbf{x})$). Shaded envelopes correspond to $\pm1$ standard deviation of the oracles, $\sigma_{\beta^{(t)}}$, with the oracle expectations, $\mu_{\beta^{(t)}}(\mathbf{x})$, shown as a solid line. Specifically, (a) at initialization, the oracle and search model are the same for AF and non-AF. Intermediate and final iterations are shown in (b-d), where the non-AF and AF oracles and search models increasingly diverge. Greyscale of training points corresponds to their importance weights used for autofocusing. In (d), each star and dotted horizontal line indicate the ground-truth value corresponding to the point of maximum density, indicative of the quality of the final search model (higher is better). The values of $(\sigma_\epsilon, \sigma_0)$ used here correspond to the ones marked by an $\times$ in Figure \ref{['fig:toy_improvement']}, which summarizes results across a range of settings. Panels (e,f) show the search model over all iterations without and with autofocusing, respectively.
  • Figure 2: Improvement from autofocusing (AF) over a wide range of settings of the illustrative example. Each colored square shows the improvement (averaged over $50$ trials) conferred by AF for one setting, $(\sigma_\epsilon,\sigma_0)$, of, respectively, the standard deviations of the training distribution and the label noise. Improvement is quantified as the difference between the ground-truth objective in Equation \ref{['eq:mbo']} achieved by the final search model with and without AF. A positive value means AF yielded higher ground-truth values (i.e., performed better than without AF), while zero means it neither helped nor hurt. Similar plots to Figure \ref{['fig:toy_demo']} are shown in the Supplementary Material for other settings (Figure \ref{['fig:toy_examples']}).
  • Figure S1: Examples of regimes where autofocus (AF) sometimes yielded lower final objectives than without (non-AF). Each row shows snapshots of CbAS in a different experimental regime, from initialization (leftmost panel), to an intermediate iteration (middle panel), to the final iteration (rightmost panel). As in Figure \ref{['fig:toy_demo']}, the vertical axis represents both $y$ values (for the oracle and ground truth) and probability density values (of the training distribution, $p_0(\mathbf{x})$, and search distributions, $p_{\theta^{(t)}}(\mathbf{x})$). Shaded envelopes correspond to $\pm1$ standard deviation of the oracles, $\sigma_{\beta^{(t)}}$, with the oracle expectations, $\mu_{\beta^{(t)}}(\mathbf{x})$, shown as a solid line. Greyscale of training points corresponds to their importance weights used in autofocusing. In the rightmost panels, for easy visualization of the final search models achieved with and without AF, the stars and dotted horizontal lines indicate the ground-truth values corresponding to the points of maximum density.
  • Figure S2: Training distribution and initial oracle for designing superconductors. Simulated training data were generated from a training distribution, $p_0(\mathbf{x})$, which was a multivariate Gaussian fit to data points with ground-truth expectations below the $80^\textit{th}$ percentile. The left panel shows histograms of the ground-truth expectations of these original data points, and the ground-truth expectations of simulated training data. The right panel illustrates the error of an initial oracle used in the experiments, by plotting the ground-truth and predicted labels of $10,000$ test points drawn from the training distribution. The RMSE here was $7.31$.
  • Figure S3: Designing superconducting materials. Trajectories of different MBO algorithms run without (left) and with autofocusing (right), on one example trial used to compute Table \ref{['tab:stats']}. At each iteration, we extract the samples with oracle expectations greater than the $80^\textit{th}$ percentile. For these samples, solid lines give the median oracle (green) and ground-truth (indigo) expectations. The shaded regions capture $70$ and $95$ percent of these quantities. The RMSE at each iteration is between the oracle and ground-truth expectations of all samples. The horizontal axis is sorted by increasing $80^\textit{th}$ percentile of oracle expectations (i.e., the samples plotted at iteration $1$ are from the iteration whose $80^\textit{th}$ percentile of oracle expectations was lowest). This ordering exposes the trend of whether the oracle expectations of samples were correlated to their ground-truth expectations. Two more algorithms are shown in Figure \ref{['fig:supercon_traj2']}.
  • ...and 1 more figures

Theorems & Definitions (7)

  • Proposition 1
  • proof : Proof of Proposition \ref{['prop:1']}.
  • Proposition S2.1
  • proof
  • Lemma S2.1: Adaptation of Lemma 4.1 in Metelli et al. (2018) Metelli2018-eh
  • Proposition S2.2
  • proof