Table of Contents
Fetching ...

Temporal Difference Flows

Jesse Farebrother, Matteo Pirotta, Andrea Tirinzoni, Rémi Munos, Alessandro Lazaric, Ahmed Touati

TL;DR

This work introduces Temporal Difference Flows (TD-Flow) to learn Geometric Horizon Models (GHMs) for long-horizon state prediction, addressing error accumulation from stepwise world-model unrolls. By formulating a Bellman equation on probability paths and employing flow matching, TD-Flow (including TD-CFM, TD$^2$-CFM, and diffusion variants) achieves stable, high-fidelity long-horizon predictions with reduced gradient variance. Theoretical results establish contraction properties of the probability-path operators and convergence to the successor measure $m^\pi$, while empirical evaluations across 22 tasks show substantial gains in both generative metrics and policy evaluation, with notable improvements in planning via Generalized Policy Improvement. Extensions to diffusion-based methods further broaden applicability, and experiments demonstrate robust long-horizon planning capabilities when integrating TD-Flow with pre-trained behavior models. Overall, TD-Flow provides a principled, scalable approach to long-horizon predictive modeling and planning in reinforcement learning.

Abstract

Predictive models of the future are fundamental for an agent's ability to reason and plan. A common strategy learns a world model and unrolls it step-by-step at inference, where small errors can rapidly compound. Geometric Horizon Models (GHMs) offer a compelling alternative by directly making predictions of future states, avoiding cumulative inference errors. While GHMs can be conveniently learned by a generative analog to temporal difference (TD) learning, existing methods are negatively affected by bootstrapping predictions at train time and struggle to generate high-quality predictions at long horizons. This paper introduces Temporal Difference Flows (TD-Flow), which leverages the structure of a novel Bellman equation on probability paths alongside flow-matching techniques to learn accurate GHMs at over 5x the horizon length of prior methods. Theoretically, we establish a new convergence result and primarily attribute TD-Flow's efficacy to reduced gradient variance during training. We further show that similar arguments can be extended to diffusion-based methods. Empirically, we validate TD-Flow across a diverse set of domains on both generative metrics and downstream tasks including policy evaluation. Moreover, integrating TD-Flow with recent behavior foundation models for planning over pre-trained policies demonstrates substantial performance gains, underscoring its promise for long-horizon decision-making.

Temporal Difference Flows

TL;DR

This work introduces Temporal Difference Flows (TD-Flow) to learn Geometric Horizon Models (GHMs) for long-horizon state prediction, addressing error accumulation from stepwise world-model unrolls. By formulating a Bellman equation on probability paths and employing flow matching, TD-Flow (including TD-CFM, TD-CFM, and diffusion variants) achieves stable, high-fidelity long-horizon predictions with reduced gradient variance. Theoretical results establish contraction properties of the probability-path operators and convergence to the successor measure , while empirical evaluations across 22 tasks show substantial gains in both generative metrics and policy evaluation, with notable improvements in planning via Generalized Policy Improvement. Extensions to diffusion-based methods further broaden applicability, and experiments demonstrate robust long-horizon planning capabilities when integrating TD-Flow with pre-trained behavior models. Overall, TD-Flow provides a principled, scalable approach to long-horizon predictive modeling and planning in reinforcement learning.

Abstract

Predictive models of the future are fundamental for an agent's ability to reason and plan. A common strategy learns a world model and unrolls it step-by-step at inference, where small errors can rapidly compound. Geometric Horizon Models (GHMs) offer a compelling alternative by directly making predictions of future states, avoiding cumulative inference errors. While GHMs can be conveniently learned by a generative analog to temporal difference (TD) learning, existing methods are negatively affected by bootstrapping predictions at train time and struggle to generate high-quality predictions at long horizons. This paper introduces Temporal Difference Flows (TD-Flow), which leverages the structure of a novel Bellman equation on probability paths alongside flow-matching techniques to learn accurate GHMs at over 5x the horizon length of prior methods. Theoretically, we establish a new convergence result and primarily attribute TD-Flow's efficacy to reduced gradient variance during training. We further show that similar arguments can be extended to diffusion-based methods. Empirically, we validate TD-Flow across a diverse set of domains on both generative metrics and downstream tasks including policy evaluation. Moreover, integrating TD-Flow with recent behavior foundation models for planning over pre-trained policies demonstrates substantial performance gains, underscoring its promise for long-horizon decision-making.

Paper Structure

This paper contains 31 sections, 28 theorems, 118 equations, 8 figures, 9 tables, 1 algorithm.

Key Result

Proposition 1

Given a conditional probability path $p_{t|Z}$ and vector field $u_{t\mid Z}$ with their associated marginal counterparts $p_t(x)$ and $v_t(x)$, we have

Figures (8)

  • Figure 1: Visual depiction of TD-Flow variants. Samples are mapped from $m_0$ to the target distribution $m_1^{(n)}$ through the neural ODE $\psi^{(n)}_t$. Dashed lines depict the neural ODE trajectory; solid lines show the conditional probability path $u_t$. (Left) td-cfm maps $X_0$ to $X_1$ before creating a separate conditional path between $X'_0$ and $X_1$, resulting in crossing paths. (Middle) td-cfm(c) directly couples $X_0$ used to generate $X_1$ when constructing the conditional probability path. (Right) td${}^2$-cfm solves the neural ODE up to time $t$ to directly obtain the target velocity $\tilde{v}_t$.
  • Figure 2: Value-Function prediction error as a function of the effective horizon $(1-\gamma)^{-1}$ for $\gamma \in \{0.8, 0.9, 0.95, 0.98, 0.99\}$ on the Pointmass loop task. td${}^2$ methods show impressive robustness to increasingly long-horizon predictions.
  • Figure 2: Template for TD-Flow algorithms
  • Figure 3: Evaluation results comparing our td-based methods along with gan and vae baselines for a single-policy. Results are computed over $19$ tasks from $4$ domains and further averaged across $3$ seeds. For each metric we highlight the best performing methods.
  • Figure 4: Performance difference between td-cfm(c) and td${}^2$-cfm for curved and straight conditional paths. Lower is better with negative values indicating a net improvement by employing a curved paths.
  • ...and 3 more figures

Theorems & Definitions (49)

  • Proposition 1: lipman2024flow
  • Lemma 1
  • Theorem 1
  • Corollary 1
  • Theorem 2
  • Theorem 3
  • Proposition 2: vincent2011connection
  • Lemma 2
  • proof
  • Lemma 2
  • ...and 39 more