Horizon Imagination: Efficient On-Policy Training in Diffusion World Models

Lior Cohen; Ofir Nabati; Kaixin Wang; Navdeep Kumar; Shie Mannor

Horizon Imagination: Efficient On-Policy Training in Diffusion World Models

Lior Cohen, Ofir Nabati, Kaixin Wang, Navdeep Kumar, Shie Mannor

TL;DR

This paper tackles the computational bottlenecks of diffusion-based world models in reinforcement learning, particularly for discrete-action control. It introduces Horizon Imagination (HI), an on-policy imagination framework that denoises multiple future observations in parallel, coupled with a stable discrete action sampling mechanism and a decoupled Horizon schedule to control denoising budget independently from the horizon. HI enables training lightweight policies with significantly reduced compute while preserving control performance, demonstrated on Atari 100K and Craftium with sub-frame budgets. The combination of parallel denoising, stability guarantees, and a flexible scheduling scheme yields higher-quality generative trajectories across varied budgets, making diffusion world models more practical for deployable RL agents.

Abstract

We study diffusion-based world models for reinforcement learning, which offer high generative fidelity but face critical efficiency challenges in control. Current methods either require heavyweight models at inference or rely on highly sequential imagination, both of which impose prohibitive computational costs. We propose Horizon Imagination (HI), an on-policy imagination process for discrete stochastic policies that denoises multiple future observations in parallel. HI incorporates a stabilization mechanism and a novel sampling schedule that decouples the denoising budget from the effective horizon over which denoising is applied while also supporting sub-frame budgets. Experiments on Atari 100K and Craftium show that our approach maintains control performance with a sub-frame budget of half the denoising steps and achieves superior generation quality under varied schedules. Code is available at https://github.com/leor-c/horizon-imagination.

Horizon Imagination: Efficient On-Policy Training in Diffusion World Models

TL;DR

Abstract

Paper Structure (47 sections, 1 theorem, 27 equations, 16 figures, 9 tables, 2 algorithms)

This paper contains 47 sections, 1 theorem, 27 equations, 16 figures, 9 tables, 2 algorithms.

Introduction
Related Work
Large Diffusion World Models
Concurrent Multi-step Generation
Diffusion World Model Agents for Control
Preliminaries
Reinforcement Learning Setup
World Model Agents
Diffusion Framework
Method
World Model Training
Horizon Imagination (World Model Inference)
Minimizing Action Changes Over the Denoising Process
The Horizon Schedule
Actor-Critic Training
...and 32 more sections

Key Result

Proposition 1

The sampling scheme ${\mathbf{a}}(\cdot, \cdot)$ satisfies the following properties:

Figures (16)

Figure 1: Example of generation instabilities observed in Craftium/ChopTree-v0 under naive action sampling during horizon imagination. In contrast, our stable sampling method produces robust, high-quality generations. The first context frame is highlighted with a blue border.
Figure 2: Comparison of the Pyramidal schedule chen2024diffusionForcing and the proposed Horizon schedule (transposed). Horizon fixes the decay horizon ($\nu=3$), yielding consistent schedules across budgets, whereas in the Pyramidal schedule the decay horizon drifts with budget, as the two are entangled, leading to degraded generation quality at higher budgets.
Figure 3: Empirical study of the average number of action changes under various settings.
Figure 4: Actor-Critic Performance. Average episodic return curves of key baselines during training. Each baseline is evaluated over 5 seeds. Curves show the mean and standard deviation, smoothed by a moving average (window size $15$). A dashed horizontal line denotes Atari human-level performance.
Figure 5: Actor–critic performance comparison between the proposed stable action sampling method and the naive baseline. Each baseline is evaluated over 5 seeds. Curves show the mean and standard deviation, smoothed by a moving average (window size $15$). A dashed horizontal line denotes Atari human-level performance.
...and 11 more figures

Theorems & Definitions (3)

Proposition 1
proof
proof

Horizon Imagination: Efficient On-Policy Training in Diffusion World Models

TL;DR

Abstract

Horizon Imagination: Efficient On-Policy Training in Diffusion World Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (3)