Autofocused oracles for model-based design
Clara Fannjiang, Jennifer Listgarten
TL;DR
The paper addresses how to design objects with desired properties when using data-driven proxies (oracles) that may be unreliable outside the training distribution. It reframes model-based design (MBD) as a non-gradient, distributional optimization and introduces autofocusing, a strategy that retrains the oracle in lockstep with the evolving search model to minimize the oracle gap between the ground-truth and oracle-based objectives. An alternating ascent-descent algorithm is proposed to reach a Nash equilibrium between the search model and the oracle, with theoretical insights on variance control and covariate shift. Empirically, autofocusing improves design performance in toy examples and a large-scale superconductors dataset, often yielding higher ground-truth objectives and better alignment between oracle and true outcomes. The work provides practical guidance for integrating adaptive oracle retraining into diverse MBO frameworks, with potential extensions to uncertainty estimation and model selection.
Abstract
Data-driven design is making headway into a number of application areas, including protein, small-molecule, and materials engineering. The design goal is to construct an object with desired properties, such as a protein that binds to a therapeutic target, or a superconducting material with a higher critical temperature than previously observed. To that end, costly experimental measurements are being replaced with calls to high-capacity regression models trained on labeled data, which can be leveraged in an in silico search for design candidates. However, the design goal necessitates moving into regions of the design space beyond where such models were trained. Therefore, one can ask: should the regression model be altered as the design algorithm explores the design space, in the absence of new data? Herein, we answer this question in the affirmative. In particular, we (i) formalize the data-driven design problem as a non-zero-sum game, (ii) develop a principled strategy for retraining the regression model as the design algorithm proceeds---what we refer to as autofocusing, and (iii) demonstrate the promise of autofocusing empirically.
