Exploring validation metrics for offline model-based optimisation with diffusion models

Christopher Beckham; Alexandre Piche; David Vazquez; Christopher Pal

Exploring validation metrics for offline model-based optimisation with diffusion models

Christopher Beckham, Alexandre Piche, David Vazquez, Christopher Pal

TL;DR

This paper tackles offline MBO where the ground-truth oracle $f$ is expensive to evaluate, and proposes a framework to identify cheap validation metrics that best correlate with $f$ across simulated datasets. It focuses on diffusion models, presenting two conditioning schemes (classifier-based and classifier-free), and introduces an extrapolated-model evaluation setup to measure how well these models can generalize beyond observed rewards. By applying five validation metrics to four Design Bench datasets, the study finds that Agreement, Fréchet Distance, and a reward-based metric typically correlate most strongly with the ground-truth reward, with hyperparameters—especially the classifier guidance weight—having a large impact on performance. The framework offers practical guidance for selecting validation metrics and tuning diffusion-based MBO systems, providing a bridge toward safer and more economical offline-to-online optimization in real-world tasks.

Abstract

In model-based optimisation (MBO) we are interested in using machine learning to design candidates that maximise some measure of reward with respect to a black box function called the (ground truth) oracle, which is expensive to compute since it involves executing a real world process. In offline MBO we wish to do so without assuming access to such an oracle during training or validation, with makes evaluation non-straightforward. While an approximation to the ground oracle can be trained and used in place of it during model validation to measure the mean reward over generated candidates, the evaluation is approximate and vulnerable to adversarial examples. Measuring the mean reward of generated candidates over this approximation is one such `validation metric', whereas we are interested in a more fundamental question which is finding which validation metrics correlate the most with the ground truth. This involves proposing validation metrics and quantifying them over many datasets for which the ground truth is known, for instance simulated environments. This is encapsulated under our proposed evaluation framework which is also designed to measure extrapolation, which is the ultimate goal behind leveraging generative models for MBO. While our evaluation framework is model agnostic we specifically evaluate denoising diffusion models due to their state-of-the-art performance, as well as derive interesting insights such as ranking the most effective validation metrics as well as discussing important hyperparameters.

Exploring validation metrics for offline model-based optimisation with diffusion models

TL;DR

This paper tackles offline MBO where the ground-truth oracle

is expensive to evaluate, and proposes a framework to identify cheap validation metrics that best correlate with

across simulated datasets. It focuses on diffusion models, presenting two conditioning schemes (classifier-based and classifier-free), and introduces an extrapolated-model evaluation setup to measure how well these models can generalize beyond observed rewards. By applying five validation metrics to four Design Bench datasets, the study finds that Agreement, Fréchet Distance, and a reward-based metric typically correlate most strongly with the ground-truth reward, with hyperparameters—especially the classifier guidance weight—having a large impact on performance. The framework offers practical guidance for selecting validation metrics and tuning diffusion-based MBO systems, providing a bridge toward safer and more economical offline-to-online optimization in real-world tasks.

Abstract

Paper Structure (55 sections, 45 equations, 14 figures, 5 tables, 2 algorithms)

This paper contains 55 sections, 45 equations, 14 figures, 5 tables, 2 algorithms.

Introduction
Contributions
Motivation and proposed framework
Training and generation
Conditioning
Classifier-based guidance
Classifier-free guidance
Extrapolation
Model selection
Final evaluation
Related work
Design Bench
Validation metrics
Use of validation set
Bayesian optimisation
...and 40 more sections

Figures (14)

Figure 1: We want to produce designs $\bm{x}$ that have high reward according to the ground truth oracle $y = {\color{DarkOrange}{f}}(\bm{x})$, but this is usually prohibitively expensive to compute since it involves executing a real-world process. If we instead considered datasets where the ground truth oracle is cheap to compute (for instance simulations), we can search for cheap-to-compute validation metrics that correlate well with the ground truth. In principle, this can facilitate faster and more economical generation of novel designs for real-world tasks where the ground truth oracle is expensive to compute.
Figure 2: A visualisation of our evaluation framework. Here, we assume joint generative models of the form ${\color{DarkBlue}{p_{\theta}}}(\bm{x},y)$. Models are trained on ${\color{DarkBlue}{\mathcal{D}_{\text{train}}}}$ as per Section \ref{['sec:training']}, and in this paper we assume the use of conditional denoising diffusion probabilistic models (DDPMs). For this class of model the joint distribution $p_{\theta}(\bm{x},y)$ decomposes into ${\color{DarkBlue}{p_{\theta}}}(\bm{x}|y)p(y)$, and the way we condition the model on $y$ is described in Section \ref{['sec:conditioning']}. In order to generate samples conditioned on rewards $y$ larger than what was observed in the training set, we must switch the prior distribution of the model, which corresponds to 'extrapolating' it and is described in Section \ref{['sec:extrapolation']}. Validation is done periodically during training and the best weights are saved for each validation metric considered. The precise details of this are described in Algorithm \ref{['alg:training']}. When the best models have been found we perform a final evaluation on the real ground truth oracle, and this process is described in Algorithm \ref{['alg:final_eval']}.
Figure 3: \ref{['fig:mbo_data_splits_db']}: Design Bench only prescribes a training split which is determined by a threshold $\gamma$ to only filter examples whose $y$'s are less than or equal to this threshold. The full dataset, while technically accessible, is not meant to be accessed for model selection as per the intended use of the framework. While the training set could be subsampled to give an 'inner' training set and validation set, the validation set would still come from the same distribution as training, which means we cannot effectively measure how well a generative model extrapolates. To address this, we retain the training set but denote everything else (examples whose rewards are $> \gamma$) to be the validation set (\ref{['fig:mbo_data_splits_mine_case1']}), and the validation oracle ${\color{DarkGreen}{f_{\phi}}}$ is trained on ${\color{DarkBlue}{\mathcal{D}_{\text{train}}}} \cup {\color{DarkGreen}{\mathcal{D}_{\text{valid}}}}$. No test set needs to be created since the ground truth oracle ${\color{DarkOrange}{f}}$ is the 'test set'. However, if the ground truth oracle does not exist because the MBO dataset is not exact, we need to also prescribe a test set (\ref{['fig:mbo_data_splits_mine_case2']}). Since there is no ground truth oracle ${\color{DarkOrange}{f}}$, we must train a 'test oracle' ${\color{DarkOrange}{\tilde{f}}}$ on ${\color{DarkBlue}{\mathcal{D}_{\text{train}}}} \cup {\color{DarkGreen}{\mathcal{D}_{\text{valid}}}} \cup {\color{DarkOrange}{\mathcal{D}_{\text{test}}}}$ (i.e. the full dataset). Note that this remains compatible with the test oracles prescribed by Design Bench, since they are also trained on the full data. Furthermore, our training sets remain identical to theirs.
Figure 4: The Pearson correlation computed for each dataset / diffusion variant. sPearson correlations are computed as per the description in Paragraph \ref{['para:experiments']}. Since each validation metric is desgned to be minimised, the ideal metric should be highly negatively correlated with the test reward (Equation \ref{['eq:test_score']}), which is to be maximised. By counting the best metric per experiment, we obtain the following counts (the more ticks the better): $-\mathcal{M}_{reward}$: ✓, $\mathcal{M}_{FD}$: ✓✓, $\mathcal{M}_{Agr}$: ✓✓✓, $\mathcal{M}_{C-DSM}$: ✓
Figure 5: Correlation plots for each dataset using the classifier-free guidance (c.f.g.) diffusion variant. Each point is colour-coded by $w$, which specifies the strength of the 'implicit' classifier that is derived (Equation \ref{['eq:cfg']}). We can see that $w$ makes a discernible difference with respect to most of the plots shown. For additional plots for other datasets, please see Section \ref{['sec:more_plots']}.
...and 9 more figures

Exploring validation metrics for offline model-based optimisation with diffusion models

TL;DR

Abstract

Exploring validation metrics for offline model-based optimisation with diffusion models

Authors

TL;DR

Abstract

Table of Contents

Figures (14)