Table of Contents
Fetching ...

Consistent World Models via Foresight Diffusion

Yu Zhang, Xingzhuo Guo, Haoran Xu, Mingsheng Long

TL;DR

This work addresses the mismatch between diffusion-based world models and the need for trajectory-consistent predictions. It introduces Foresight Diffusion (ForeDiff), a two-stream architecture that decouples condition understanding from denoising and leverages a pretrained deterministic predictor to guide generation. Through a two-stage training scheme andPredHead-assisted conditioning, ForeDiff achieves higher predictive accuracy and substantially improved sample consistency on robot video prediction and scientific spatiotemporal forecasting. The results demonstrate a promising direction for reliable diffusion-based world models with improved control and stability in real-world and physical-domain tasks.

Abstract

Diffusion and flow-based models have enabled significant progress in generation tasks across various modalities and have recently found applications in world modeling. However, unlike typical generation tasks that encourage sample diversity, world models entail different sources of uncertainty and require consistent samples aligned with the ground-truth trajectory, which is a limitation we empirically observe in diffusion models. We argue that a key bottleneck in learning consistent diffusion-based world models lies in the suboptimal predictive ability, which we attribute to the entanglement of condition understanding and target denoising within shared architectures and co-training schemes. To address this, we propose Foresight Diffusion (ForeDiff), a diffusion-based world modeling framework that enhances consistency by decoupling condition understanding from target denoising. ForeDiff incorporates a separate deterministic predictive stream to process conditioning inputs independently of the denoising stream, and further leverages a pretrained predictor to extract informative representations that guide generation. Extensive experiments on robot video prediction and scientific spatiotemporal forecasting show that ForeDiff improves both predictive accuracy and sample consistency over strong baselines, offering a promising direction for diffusion-based world models.

Consistent World Models via Foresight Diffusion

TL;DR

This work addresses the mismatch between diffusion-based world models and the need for trajectory-consistent predictions. It introduces Foresight Diffusion (ForeDiff), a two-stream architecture that decouples condition understanding from denoising and leverages a pretrained deterministic predictor to guide generation. Through a two-stage training scheme andPredHead-assisted conditioning, ForeDiff achieves higher predictive accuracy and substantially improved sample consistency on robot video prediction and scientific spatiotemporal forecasting. The results demonstrate a promising direction for reliable diffusion-based world models with improved control and stability in real-world and physical-domain tasks.

Abstract

Diffusion and flow-based models have enabled significant progress in generation tasks across various modalities and have recently found applications in world modeling. However, unlike typical generation tasks that encourage sample diversity, world models entail different sources of uncertainty and require consistent samples aligned with the ground-truth trajectory, which is a limitation we empirically observe in diffusion models. We argue that a key bottleneck in learning consistent diffusion-based world models lies in the suboptimal predictive ability, which we attribute to the entanglement of condition understanding and target denoising within shared architectures and co-training schemes. To address this, we propose Foresight Diffusion (ForeDiff), a diffusion-based world modeling framework that enhances consistency by decoupling condition understanding from target denoising. ForeDiff incorporates a separate deterministic predictive stream to process conditioning inputs independently of the denoising stream, and further leverages a pretrained predictor to extract informative representations that guide generation. Extensive experiments on robot video prediction and scientific spatiotemporal forecasting show that ForeDiff improves both predictive accuracy and sample consistency over strong baselines, offering a promising direction for diffusion-based world models.

Paper Structure

This paper contains 41 sections, 1 theorem, 20 equations, 7 figures, 8 tables.

Key Result

Lemma 3.1

For a diffusion model as defined in Section sec:preliminaries, by reparameterizing the output as $\hat{\mathbf{x}}_\theta(\mathbf{x}_t, t, \mathbf{y}) = \mathbf{x}_t - t \cdot \mathbf{v}_\theta(\mathbf{x}_t, t, \mathbf{y})$EDM-karras2022elucidating, the training objective at $t = 1$ simplifies to: where the noise $\bm{\epsilon}$ is independent of both $\mathbf{x}_0$ and $\mathbf{y}$. Furthermore,

Figures (7)

  • Figure 1: Aligning model stochasticity with task demands. (Left) Generation tasks favor diversity, making diffusion models ideal for producing varied samples. (Middle) In contrast, world model tasks require a balance between generation and prediction, and vanilla diffusion models demonstrate unsatisfactory sample consistency. (Right) Foresight Diffusion achieves a midpoint between highly stochastic generative models and fully deterministic models, making it well-suited for world modeling.
  • Figure 2: Kernel density estimation curves of sample-wise LPIPS distributions. Shaded areas represent estimated probability densities; dashed lines indicate sample means. Lower LPIPS scores correspond to better quality.
  • Figure 3: Comparison between diffusion-based world models and existing baselines. (a) Vanilla diffusion achieves competitive FVD with significantly fewer parameters, demonstrating high model efficiency. (b) Vanilla diffusion performs well on best and average LPIPS, but suffers from higher worst-case error, highlighting poor sample consistency. (c) Vanilla diffusion underperforms a deterministic predictor in absence of noisy targets, revealing its limited predictive ability.
  • Figure 4: Overview of Foresight Diffusion. (a) Vanilla diffusion jointly processes condition and noisy target, limiting its predictive ability. (b) A Deterministic model focuses solely on condition understanding and achieves better predictive performance. (c) ForeDiff-zero introduces a separate predictive stream to isolate condition understanding from noise. (d) ForeDiff further adopts a two-stage scheme: it pre-trains the predictive stream, then freezes its representations to guide generation.
  • Figure 5: Visualization of results on RoboNet dataset (zoom in for details). In vanilla diffusion models, the pink shovel (left) appears distorted, while the toy object (right) collapses entirely. In contrast, ForeDiff produces more structurally plausible and visually coherent outputs.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Lemma 3.1
  • proof