Table of Contents
Fetching ...

Energy-Weighted Flow Matching for Offline Reinforcement Learning

Shiyuan Zhang, Weitong Zhang, Quanquan Gu

TL;DR

The paper addresses the challenge of energy-guided generation, where the target distribution $q(\mathbf{x})$ is shaped by an energy function as $q(\mathbf{x}) \propto p(\mathbf{x}) \exp(-\beta \mathcal{E}(\mathbf{x}))$. It introduces Energy-Weighted Flow Matching (EFM) and Energy-Weighted Diffusion (ED) to directly learn energy-guided flows and diffusion processes without auxiliary models, backed by theoretical guarantees that these methods reproduce the energy-guided distribution. It extends these ideas to offline reinforcement learning via Q-weighted Iterative Policy Optimization (QIPO), combining energy-guided sampling with iterative policy refinement to improve performance on D4RL benchmarks, and demonstrates faster sampling relative to some baselines while maintaining or improving effectiveness. The work provides a first exact energy-guided flow matching model and a diffusion model that directly incorporates energy guidance, enabling simpler, more accurate control of generative outcomes and impactful applications across domains such as image synthesis, molecular design, and offline RL. Overall, the proposed framework reduces modeling complexity, improves guided generation, and offers a practical pathway to incorporate energy-based objectives into diffusion and flow-based generative models.

Abstract

This paper investigates energy guidance in generative modeling, where the target distribution is defined as $q(\mathbf x) \propto p(\mathbf x)\exp(-β\mathcal E(\mathbf x))$, with $p(\mathbf x)$ being the data distribution and $\mathcal E(\mathcal x)$ as the energy function. To comply with energy guidance, existing methods often require auxiliary procedures to learn intermediate guidance during the diffusion process. To overcome this limitation, we explore energy-guided flow matching, a generalized form of the diffusion process. We introduce energy-weighted flow matching (EFM), a method that directly learns the energy-guided flow without the need for auxiliary models. Theoretical analysis shows that energy-weighted flow matching accurately captures the guided flow. Additionally, we extend this methodology to energy-weighted diffusion models and apply it to offline reinforcement learning (RL) by proposing the Q-weighted Iterative Policy Optimization (QIPO). Empirically, we demonstrate that the proposed QIPO algorithm improves performance in offline RL tasks. Notably, our algorithm is the first energy-guided diffusion model that operates independently of auxiliary models and the first exact energy-guided flow matching model in the literature.

Energy-Weighted Flow Matching for Offline Reinforcement Learning

TL;DR

The paper addresses the challenge of energy-guided generation, where the target distribution is shaped by an energy function as . It introduces Energy-Weighted Flow Matching (EFM) and Energy-Weighted Diffusion (ED) to directly learn energy-guided flows and diffusion processes without auxiliary models, backed by theoretical guarantees that these methods reproduce the energy-guided distribution. It extends these ideas to offline reinforcement learning via Q-weighted Iterative Policy Optimization (QIPO), combining energy-guided sampling with iterative policy refinement to improve performance on D4RL benchmarks, and demonstrates faster sampling relative to some baselines while maintaining or improving effectiveness. The work provides a first exact energy-guided flow matching model and a diffusion model that directly incorporates energy guidance, enabling simpler, more accurate control of generative outcomes and impactful applications across domains such as image synthesis, molecular design, and offline RL. Overall, the proposed framework reduces modeling complexity, improves guided generation, and offers a practical pathway to incorporate energy-based objectives into diffusion and flow-based generative models.

Abstract

This paper investigates energy guidance in generative modeling, where the target distribution is defined as , with being the data distribution and as the energy function. To comply with energy guidance, existing methods often require auxiliary procedures to learn intermediate guidance during the diffusion process. To overcome this limitation, we explore energy-guided flow matching, a generalized form of the diffusion process. We introduce energy-weighted flow matching (EFM), a method that directly learns the energy-guided flow without the need for auxiliary models. Theoretical analysis shows that energy-weighted flow matching accurately captures the guided flow. Additionally, we extend this methodology to energy-weighted diffusion models and apply it to offline reinforcement learning (RL) by proposing the Q-weighted Iterative Policy Optimization (QIPO). Empirically, we demonstrate that the proposed QIPO algorithm improves performance in offline RL tasks. Notably, our algorithm is the first energy-guided diffusion model that operates independently of auxiliary models and the first exact energy-guided flow matching model in the literature.

Paper Structure

This paper contains 47 sections, 10 theorems, 44 equations, 11 figures, 5 tables, 6 algorithms.

Key Result

Theorem 3.1

Given the conditional vector field $\mathbf{u}_{t0}(\mathbf{x} | \mathbf{x}_0)$ that generates the conditional distribution $p_{t0}(\mathbf{x} | \mathbf{x}_0)$, then the "marginal" vector field $\mathbf{u}_t(\mathbf{x}) = \int_{\mathbf{x}_0} p_{0t}(\mathbf{x}_0 | \mathbf{x}) \mathbf{u}_{t0}(\mathbf{ where $t \sim \lambda(t)$, $\mathbf{x}_0$ follows the data distribution $p_0(\cdot)$ and $\mathbf{x

Figures (11)

  • Figure 1: Visualization of the ground-truth distribution $p(\mathbf{x}) p^\beta(c=1 | \mathbf{x})$ with different values of $\beta$, the posterior distribution $p(\mathbf{x} | c)$ with $c \in \{0, 1\}$, and the data sampled from classifier-free diffusion and energy-weighted diffusion. The energy-weighted diffusion process demonstrates better performance when $\beta > 1$. More examples and details of this experiments are provided in Appendix \ref{['app:bandit']}.
  • Figure 2: The ground truth distribution of $q(\mathbf{x}) \propto p(\mathbf{x}) \exp(-\beta \mathcal{E}(\mathbf{x})) = p(\mathbf{x}) p^\beta(c = 1 | \mathbf{x})$ with different $\beta$, the posterior distribution $p(\mathbf{x} | c = 0)$ (Negative) and $p(\mathbf{x} | c = 1)$ (Positive) over 6 data distributions.
  • Figure 3: The data sampled from the score function trained by energy-weighted diffusion and the score function composed by classifier-free guidance with different guidance scale $\beta$. We sample 2000 data points for each state.
  • Figure 4: Comparison of the time complexity of the action selecting function for CEP, QIPO-Diff and QIPO-OT across various tasks. The results are averaged over 3 individual runs.
  • Figure 5: Comparison of normalized cumulative rewards for different support set sizes ($M = 16$, $M = 32$, and $M = 64$) across various tasks. The results are averaged over 3 individual runs.
  • ...and 6 more figures

Theorems & Definitions (15)

  • Theorem 3.1: Theorem 1, 2; lipman2022flow
  • Lemma 3.2: Lemma 1, zheng2023guided
  • Theorem 3.3: Theorem 3.1, lu2023contrastive
  • Theorem 4.1
  • Remark 4.2
  • Theorem 4.3
  • Remark 4.4: Regarding the weighted energy guided loss $\mathcal{L}_{\text{EFM}}$
  • Remark 4.5: Regarding the conditional weighted energy guided loss $\mathcal{L}_{\text{CEFM}}$
  • Remark 4.6: Connection with the importance sampling
  • Corollary 4.7
  • ...and 5 more