Table of Contents
Fetching ...

WorldCache: Accelerating World Models for Free via Heterogeneous Token Caching

Weilun Feng, Guoxin Fan, Haotong Qin, Chuanguang Yang, Mingqiang Wu, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Dingrui Wang, Longlong Liao, Michele Magno, Yongjun Xu

TL;DR

Experiments on diffusion world models show that WorldCache delivers up to 3.7 end-to-end speedups while maintaining rollout quality, demonstrating the vast advantages and practicality of WorldCache in resource-constrained scenarios.

Abstract

Diffusion-based world models have shown strong potential for unified world simulation, but the iterative denoising remains too costly for interactive use and long-horizon rollouts. While feature caching can accelerate inference without training, we find that policies designed for single-modal diffusion transfer poorly to world models due to two world-model-specific obstacles: \emph{token heterogeneity} from multi-modal coupling and spatial variation, and \emph{non-uniform temporal dynamics} where a small set of hard tokens drives error growth, making uniform skipping either unstable or overly conservative. We propose \textbf{WorldCache}, a caching framework tailored to diffusion world models. We introduce \textit{Curvature-guided Heterogeneous Token Prediction}, which uses a physics-grounded curvature score to estimate token predictability and applies a Hermite-guided damped predictor for chaotic tokens with abrupt direction changes. We also design \textit{Chaotic-prioritized Adaptive Skipping}, which accumulates a curvature-normalized, dimensionless drift signal and recomputes only when bottleneck tokens begin to drift. Experiments on diffusion world models show that WorldCache delivers up to \textbf{3.7$\times$} end-to-end speedups while maintaining \textbf{98\%} rollout quality, demonstrating the vast advantages and practicality of WorldCache in resource-constrained scenarios. Our code is released in https://github.com/FofGofx/WorldCache.

WorldCache: Accelerating World Models for Free via Heterogeneous Token Caching

TL;DR

Experiments on diffusion world models show that WorldCache delivers up to 3.7 end-to-end speedups while maintaining rollout quality, demonstrating the vast advantages and practicality of WorldCache in resource-constrained scenarios.

Abstract

Diffusion-based world models have shown strong potential for unified world simulation, but the iterative denoising remains too costly for interactive use and long-horizon rollouts. While feature caching can accelerate inference without training, we find that policies designed for single-modal diffusion transfer poorly to world models due to two world-model-specific obstacles: \emph{token heterogeneity} from multi-modal coupling and spatial variation, and \emph{non-uniform temporal dynamics} where a small set of hard tokens drives error growth, making uniform skipping either unstable or overly conservative. We propose \textbf{WorldCache}, a caching framework tailored to diffusion world models. We introduce \textit{Curvature-guided Heterogeneous Token Prediction}, which uses a physics-grounded curvature score to estimate token predictability and applies a Hermite-guided damped predictor for chaotic tokens with abrupt direction changes. We also design \textit{Chaotic-prioritized Adaptive Skipping}, which accumulates a curvature-normalized, dimensionless drift signal and recomputes only when bottleneck tokens begin to drift. Experiments on diffusion world models show that WorldCache delivers up to \textbf{3.7} end-to-end speedups while maintaining \textbf{98\%} rollout quality, demonstrating the vast advantages and practicality of WorldCache in resource-constrained scenarios. Our code is released in https://github.com/FofGofx/WorldCache.
Paper Structure (66 sections, 2 theorems, 26 equations, 13 figures, 6 tables, 1 algorithm)

This paper contains 66 sections, 2 theorems, 26 equations, 13 figures, 6 tables, 1 algorithm.

Key Result

Theorem 4.1

Let $\kappa_i$ be computed by Eq. eq:wc_curvature_compact. For any feature deviation $\Delta\mathbf{y}_{t,i}$ that shares the same modality/timestep scalar units as $\mathbf{y}_{t,i}$ (e.g., $\Delta\mathbf{y}_{t,i}=\tilde{\mathbf{y}}_{t,i}-\tilde{\mathbf{y}}_{t+1,i}$), the product $\kappa_i\cdot\|\D where the residual $o(1)$ only arises from dimensionless numerical terms. Detailed proofs in Append

Figures (13)

  • Figure 1: WorldCache greatly accelerates two diffusion world models: HunyuanVoyager huang2025voyager and Aether zhu2025aether with up to 3.7$\times$ speedup, while preserving high-fidelity details.
  • Figure 2: Overview of the proposed WorldCache framework. The pipeline alternates between FULL backbone evaluation and CACHE approximation. (Top) In each full computation step, tokens are partitioned into Stable, Linear, and Chaotic groups based on their curvature $\kappa$. (Bottom) During caching steps, heterogeneous predictors (Reuse, Linear Extrapolation, or Damped Update) are applied accordingly. (Left) The Chaotic-prioritized Adaptive Skipping (CAS) mechanism accumulates a curvature-normalized drift score $E_{acc}$ specifically from chaotic tokens, triggering a full computation only when critical drift is detected.
  • Figure 3: An illustration of token heterogeneity.(a) Modality and Spatial Variance: Distinct patterns between modalities and across spatial regions. (b) Trajectory Dynamics: Three trajectory trends: static, predictable, and sharp, non-linear direction shifts that defy simple extrapolation. More analysis in Appendix Sec. \ref{['sec:more_analysis_token']}
  • Figure 4: Mechanism and effectiveness of the Damped Update.(a) Trajectory Illustration: Damped update stabilizes prediction through historical $\mathbf{v}_{t^\star-1}$. (b) Quantitative Error Analysis: Damped update reduces chaotic tokens cache error as the prediction window $k$ increases.
  • Figure 5: An illustration of non-uniform temporal dynamics. We plot the feature difference magnitude across denoising steps for different token percentiles ($p_{25}$ to $p_{100}$). The global drift is dominated by a small subset of "hard" tokens (top percentile, red line), while the majority remain stable. More analysis in Appendix Sec. \ref{['sec:more_adaptive_skip']}.
  • ...and 8 more figures

Theorems & Definitions (3)

  • Theorem 4.1: Curvature-induced dimensionless normalization
  • Theorem 1.1: Curvature-induced dimensionless normalization
  • proof