Table of Contents
Fetching ...

Bridging Model-Based Optimization and Generative Modeling via Conservative Fine-Tuning of Diffusion Models

Masatoshi Uehara, Yulai Zhao, Ehsan Hajiramezanali, Gabriele Scalia, Gökcen Eraslan, Avantika Lal, Sergey Levine, Tommaso Biancalani

TL;DR

This work tackles offline design optimization by marrying generative diffusion models with offline reward learning. The authors introduce BRAID, a doubly conservative fine-tuning framework that learns a conservative reward model from offline data and updates a pre-trained diffusion model with KL penalties to stay within the valid design space and suppress over-optimization. They provide a regret-style theoretical guarantee for the soft-entropy regularized objective and demonstrate, across DNA/RNA sequence design and image generation tasks, that BRAID can surpass the best designs in the offline data while avoiding invalid outputs. The approach leverages uncertainty-aware penalties and pre-trained generators to exploit reward-model extrapolation safely, offering a principled path to robust offline diffusion-based design. The results suggest practical impact for scientific design problems where offline data is abundant but online reward feedback is scarce or expensive.

Abstract

AI-driven design problems, such as DNA/protein sequence design, are commonly tackled from two angles: generative modeling, which efficiently captures the feasible design space (e.g., natural images or biological sequences), and model-based optimization, which utilizes reward models for extrapolation. To combine the strengths of both approaches, we adopt a hybrid method that fine-tunes cutting-edge diffusion models by optimizing reward models through RL. Although prior work has explored similar avenues, they primarily focus on scenarios where accurate reward models are accessible. In contrast, we concentrate on an offline setting where a reward model is unknown, and we must learn from static offline datasets, a common scenario in scientific domains. In offline scenarios, existing approaches tend to suffer from overoptimization, as they may be misled by the reward model in out-of-distribution regions. To address this, we introduce a conservative fine-tuning approach, BRAID, by optimizing a conservative reward model, which includes additional penalization outside of offline data distributions. Through empirical and theoretical analysis, we demonstrate the capability of our approach to outperform the best designs in offline data, leveraging the extrapolation capabilities of reward models while avoiding the generation of invalid designs through pre-trained diffusion models.

Bridging Model-Based Optimization and Generative Modeling via Conservative Fine-Tuning of Diffusion Models

TL;DR

This work tackles offline design optimization by marrying generative diffusion models with offline reward learning. The authors introduce BRAID, a doubly conservative fine-tuning framework that learns a conservative reward model from offline data and updates a pre-trained diffusion model with KL penalties to stay within the valid design space and suppress over-optimization. They provide a regret-style theoretical guarantee for the soft-entropy regularized objective and demonstrate, across DNA/RNA sequence design and image generation tasks, that BRAID can surpass the best designs in the offline data while avoiding invalid outputs. The approach leverages uncertainty-aware penalties and pre-trained generators to exploit reward-model extrapolation safely, offering a principled path to robust offline diffusion-based design. The results suggest practical impact for scientific design problems where offline data is abundant but online reward feedback is scarce or expensive.

Abstract

AI-driven design problems, such as DNA/protein sequence design, are commonly tackled from two angles: generative modeling, which efficiently captures the feasible design space (e.g., natural images or biological sequences), and model-based optimization, which utilizes reward models for extrapolation. To combine the strengths of both approaches, we adopt a hybrid method that fine-tunes cutting-edge diffusion models by optimizing reward models through RL. Although prior work has explored similar avenues, they primarily focus on scenarios where accurate reward models are accessible. In contrast, we concentrate on an offline setting where a reward model is unknown, and we must learn from static offline datasets, a common scenario in scientific domains. In offline scenarios, existing approaches tend to suffer from overoptimization, as they may be misled by the reward model in out-of-distribution regions. To address this, we introduce a conservative fine-tuning approach, BRAID, by optimizing a conservative reward model, which includes additional penalization outside of offline data distributions. Through empirical and theoretical analysis, we demonstrate the capability of our approach to outperform the best designs in offline data, leveraging the extrapolation capabilities of reward models while avoiding the generation of invalid designs through pre-trained diffusion models.
Paper Structure (61 sections, 4 theorems, 45 equations, 8 figures, 5 tables, 3 algorithms)

This paper contains 61 sections, 4 theorems, 45 equations, 8 figures, 5 tables, 3 algorithms.

Key Result

Theorem 1

Let $\hat{p}_{\alpha}(\cdot)$ be an induced distribution from optimal policies $\{\hat{p}_t\}_{t= T+1}^1$ in eq:key_plnanning, i.e., $\hat{p}(x_0) = \int \{\prod_{t=T+1}^1 \hat{p}_t(x_{t-1}|x_t)\}d x_{1:T}$ when $\{\Pi_t\}$ is a global policy class ($\Pi_t= \{\mathcal{X} \to \Delta(\mathcal{X})\}$).

Figures (8)

  • Figure 1: The left figure illustrates our setup with a pre-trained generative model and offline data. On the right, the motivation of the algorithm is depicted. The region surrounded by the green line is the original entire design space, with the colored region indicating the valid design space (e.g., natural images, human-like DNA sequences). The red region denotes areas with more offline data available, while the blue region indicates areas with less data available. We aim to add penalties to the blue regions using conservative reward modeling to prevent overoptimization while imposing a stricter KL penalty on the non-colored regions to prevent the generation of invalid designs.
  • Figure 2: Barplots of the rewards $r(x)$ for samples generated by each algorithm. It reveals that proposals consistently outperform baselines.
  • Figure 3: (c) Generated images
  • Figure 4: UTRs
  • Figure 5: Enhancers
  • ...and 3 more figures

Theorems & Definitions (12)

  • Example 1: Gaussian processes.
  • Example 2: Bootstrap
  • Theorem 1
  • Remark 1: Novelty of thm:key
  • Theorem 2: Per-step regret
  • Example 3
  • Corollary 1: Informal: Formal characterization is in sec:GPs
  • Remark 2
  • Theorem 3: Marginal and Posterior distributions
  • Definition 1
  • ...and 2 more