Table of Contents
Fetching ...

Persistent Robot World Models: Stabilizing Multi-Step Rollouts via Reinforcement Learning

Jai Bardhan, Patrik Drozdik, Josef Sivic, Vladimir Petrik

Abstract

Action-conditioned robot world models generate future video frames of the manipulated scene given a robot action sequence, offering a promising alternative for simulating tasks that are difficult to model with traditional physics engines. However, these models are optimized for short-term prediction and break down when deployed autoregressively: each predicted clip feeds back as context for the next, causing errors to compound and visual quality to rapidly degrade. We address this through the following contributions. First, we introduce a reinforcement learning (RL) post-training scheme that trains the world model on its own autoregressive rollouts rather than on ground-truth histories. We achieve this by adapting a recent contrastive RL objective for diffusion models to our setting and show that its convergence guarantees carry over exactly. Second, we design a training protocol that generates and compares multiple candidate variable-length futures from the same rollout state, reinforcing higher-fidelity predictions over lower-fidelity ones. Third, we develop efficient, multi-view visual fidelity rewards that combine complementary perceptual metrics across camera views and are aggregated at the clip level for dense, low-variance training signal. Fourth, we show that our approach establishes a new state-of-the-art for rollout fidelity on the DROID dataset, outperforming the strongest baseline on all metrics (e.g., LPIPS reduced by 14% on external cameras, SSIM improved by 9.1% on the wrist camera), winning 98% of paired comparisons, and achieving an 80% preference rate in a blind human study.

Persistent Robot World Models: Stabilizing Multi-Step Rollouts via Reinforcement Learning

Abstract

Action-conditioned robot world models generate future video frames of the manipulated scene given a robot action sequence, offering a promising alternative for simulating tasks that are difficult to model with traditional physics engines. However, these models are optimized for short-term prediction and break down when deployed autoregressively: each predicted clip feeds back as context for the next, causing errors to compound and visual quality to rapidly degrade. We address this through the following contributions. First, we introduce a reinforcement learning (RL) post-training scheme that trains the world model on its own autoregressive rollouts rather than on ground-truth histories. We achieve this by adapting a recent contrastive RL objective for diffusion models to our setting and show that its convergence guarantees carry over exactly. Second, we design a training protocol that generates and compares multiple candidate variable-length futures from the same rollout state, reinforcing higher-fidelity predictions over lower-fidelity ones. Third, we develop efficient, multi-view visual fidelity rewards that combine complementary perceptual metrics across camera views and are aggregated at the clip level for dense, low-variance training signal. Fourth, we show that our approach establishes a new state-of-the-art for rollout fidelity on the DROID dataset, outperforming the strongest baseline on all metrics (e.g., LPIPS reduced by 14% on external cameras, SSIM improved by 9.1% on the wrist camera), winning 98% of paired comparisons, and achieving an 80% preference rate in a blind human study.

Paper Structure

This paper contains 28 sections, 2 theorems, 37 equations, 11 figures, 12 tables, 1 algorithm.

Key Result

lemma 1

The posterior $\pi^{\mathrm{old}}(\mathbf{x}_0|\mathbf{x}_\sigma,\mathbf{c})$ is a mixture of the positive and negative posteriors: where the data-dependent mixing weight (abbreviated $\alpha \equiv \alpha(\mathbf{x}_\sigma,\mathbf{c})$ below) is

Figures (11)

  • Figure 1: Autoregressive video rollout quality. We use an action-conditioned robot world model to generate multi-view image predictions from a single observed state. The ground-truth (blue) is compared against the baseline (red) and our post-trained model PersistWorld (green). While the baseline accumulates error and destroys the object (cyan bowl) within seconds, our method maintains structural integrity and spatial consistency, establishing a new state-of-the-art in rollout fidelity.
  • Figure 2: Overview of our method:(Top)Autoregressive inference: A robot policy generates actions fed to the world model, which produces multi-view frames that are appended to the history buffer and condition the next generation step. (Bottom)RL post-training:(S1) A shared variable-length prefix is rolled out autoregressively from a ground-truth initial condition. (S2)$K$ independent candidate continuations are branched from the frozen prefix state. (S3) Candidates are scored against ground-truth using multi-view perceptual rewards. (S4) Reward weights $r$ scale implicit positive/negative $\mathbf{x}_0$ predictions used in contrastive model updates via loss $L$.
  • Figure 3: Qualitative comparison of autoregressive rollout stability. We compare long-horizon (11 s) generations from the baselinectrlworld against our PersistWorld for the wrist camera. Left: Object-centric fidelity. The baseline model suffers from rapid decoherence; as errors compound in the history buffer, manipulated objects like the cup lose their structural identity and dissolve into amorphous textures. In contrast, our method maintains the spatial consistency and structural integrity of the object throughout the rollout. Right: Robot-centric consistency. The baseline exhibits significant robot decoherence, where the generated robot arm loses their geometric structure. Our approach maintains structural persistence. Please see additional video results on the associated project page.
  • Figure 4: $\Delta_{\text{metric}}$ of paired videos from the validation dataset. On $1-1$ paired comparison, our PersistWorld world model is better than the baseline on $\sim98\%$ of the sample ($p < 10^{-6}$).
  • Figure 5: Temporal evolution of wrist camera metrics. While both models exhibit natural degradation over longer horizons (x-axis), our post-trained model, PersistWorld (green), consistently maintains higher fidelity and slower error accumulation compared to the baseline (orange). Specifically, our method preserves a higher PSNR and SSIM while suppressing LPIPS drift, effectively extending the stable prediction horizon for complex, fine-grained interactions. See Fig. \ref{['fig:external-cam-temporal-evolution']} in Appendix. \ref{['app:additional-results']} for external camera results.
  • ...and 6 more figures

Theorems & Definitions (4)

  • lemma 1: Posterior mixture
  • proof
  • theorem 1: Optimal predictor
  • proof