Bellman Diffusion Models

Liam Schramm; Abdeslam Boularias

Bellman Diffusion Models

Liam Schramm, Abdeslam Boularias

TL;DR

This paper introduces Bellman Diffusion Models (BDM) as off-policy, generative estimators of the successor state measure, enforcing Bellman-flow constraints to yield a simple Bellman update on diffusion step distributions. It derives a TD-like update for diffusion models and provides KL-based bounds to connect diffusion predictions with Bellman consistency, enabling a practical, low-variance objective. The authors propose TD3-SBC, an offline RL algorithm that regularizes both actions and future states via a Bellman diffusion term, built on ReBRAC, and show state-of-the-art results on D4RL with improved stability. The approach bridges state-occupancy perspectives and practical RL by enabling direct regularization of the SSM, reducing distribution shift and broadening the applicability of diffusion-based policies in offline settings.

Abstract

Diffusion models have seen tremendous success as generative architectures. Recently, they have been shown to be effective at modelling policies for offline reinforcement learning and imitation learning. We explore using diffusion as a model class for the successor state measure (SSM) of a policy. We find that enforcing the Bellman flow constraints leads to a simple Bellman update on the diffusion step distribution.

Bellman Diffusion Models

TL;DR

Abstract

Paper Structure (20 sections, 4 theorems, 41 equations, 4 tables, 3 algorithms)

This paper contains 20 sections, 4 theorems, 41 equations, 4 tables, 3 algorithms.

Introduction
Background
Related work
Notation and Definitions
Derivation
KL Divergence between Diffusion Models
Bounding the Bellman Flow Divergence
TD Update
Algorithm
Theoretical analysis
Offline Reinforcement Learning Algorithm
Experiments
Conclusion
Proofs mentioned in the main text
KL divergence between diffusion models
...and 5 more sections

Key Result

Lemma 1

Let $q$ and $p$ be $K$-step diffusion models with noise schedule $\beta_i$, parameterized by neural networks with outputs $\epsilon_q$ and $\epsilon_p$, respectively. Let $q_i$ and $p_i$ be the distribution of the samples generated by the first $K-i$ steps of the forward process of $q$ and $p$, resp

Theorems & Definitions (7)

Lemma 1
Proposition 1
Proposition 2
Corollary 1
proof
proof
proof

Bellman Diffusion Models

TL;DR

Abstract

Bellman Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (7)