Temporal Difference Flows
Jesse Farebrother, Matteo Pirotta, Andrea Tirinzoni, Rémi Munos, Alessandro Lazaric, Ahmed Touati
TL;DR
This work introduces Temporal Difference Flows (TD-Flow) to learn Geometric Horizon Models (GHMs) for long-horizon state prediction, addressing error accumulation from stepwise world-model unrolls. By formulating a Bellman equation on probability paths and employing flow matching, TD-Flow (including TD-CFM, TD$^2$-CFM, and diffusion variants) achieves stable, high-fidelity long-horizon predictions with reduced gradient variance. Theoretical results establish contraction properties of the probability-path operators and convergence to the successor measure $m^\pi$, while empirical evaluations across 22 tasks show substantial gains in both generative metrics and policy evaluation, with notable improvements in planning via Generalized Policy Improvement. Extensions to diffusion-based methods further broaden applicability, and experiments demonstrate robust long-horizon planning capabilities when integrating TD-Flow with pre-trained behavior models. Overall, TD-Flow provides a principled, scalable approach to long-horizon predictive modeling and planning in reinforcement learning.
Abstract
Predictive models of the future are fundamental for an agent's ability to reason and plan. A common strategy learns a world model and unrolls it step-by-step at inference, where small errors can rapidly compound. Geometric Horizon Models (GHMs) offer a compelling alternative by directly making predictions of future states, avoiding cumulative inference errors. While GHMs can be conveniently learned by a generative analog to temporal difference (TD) learning, existing methods are negatively affected by bootstrapping predictions at train time and struggle to generate high-quality predictions at long horizons. This paper introduces Temporal Difference Flows (TD-Flow), which leverages the structure of a novel Bellman equation on probability paths alongside flow-matching techniques to learn accurate GHMs at over 5x the horizon length of prior methods. Theoretically, we establish a new convergence result and primarily attribute TD-Flow's efficacy to reduced gradient variance during training. We further show that similar arguments can be extended to diffusion-based methods. Empirically, we validate TD-Flow across a diverse set of domains on both generative metrics and downstream tasks including policy evaluation. Moreover, integrating TD-Flow with recent behavior foundation models for planning over pre-trained policies demonstrates substantial performance gains, underscoring its promise for long-horizon decision-making.
