Table of Contents
Fetching ...

A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens

Tommie Kerssies, Gabriele Berton, Ju He, Qihang Yu, Wufei Ma, Daan de Geus, Gijs Dubbelman, Liang-Chieh Chen

Abstract

Anticipating diverse future states is a central challenge in video world modeling. Discriminative world models produce a deterministic prediction that implicitly averages over possible futures, while existing generative world models remain computationally expensive. Recent work demonstrates that predicting the future in the feature space of a vision foundation model (VFM), rather than a latent space optimized for pixel reconstruction, requires significantly fewer world model parameters. However, most such approaches remain discriminative. In this work, we introduce DeltaTok, a tokenizer that encodes the VFM feature difference between consecutive frames into a single continuous "delta" token, and DeltaWorld, a generative world model operating on these tokens to efficiently generate diverse plausible futures. Delta tokens reduce video from a three-dimensional spatio-temporal representation to a one-dimensional temporal sequence, for example yielding a 1,024x token reduction with 512x512 frames. This compact representation enables tractable multi-hypothesis training, where many futures are generated in parallel and only the best is supervised. At inference, this leads to diverse predictions in a single forward pass. Experiments on dense forecasting tasks demonstrate that DeltaWorld forecasts futures that more closely align with real-world outcomes, while having over 35x fewer parameters and using 2,000x fewer FLOPs than existing generative world models. Code and weights: https://deltatok.github.io.

A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens

Abstract

Anticipating diverse future states is a central challenge in video world modeling. Discriminative world models produce a deterministic prediction that implicitly averages over possible futures, while existing generative world models remain computationally expensive. Recent work demonstrates that predicting the future in the feature space of a vision foundation model (VFM), rather than a latent space optimized for pixel reconstruction, requires significantly fewer world model parameters. However, most such approaches remain discriminative. In this work, we introduce DeltaTok, a tokenizer that encodes the VFM feature difference between consecutive frames into a single continuous "delta" token, and DeltaWorld, a generative world model operating on these tokens to efficiently generate diverse plausible futures. Delta tokens reduce video from a three-dimensional spatio-temporal representation to a one-dimensional temporal sequence, for example yielding a 1,024x token reduction with 512x512 frames. This compact representation enables tractable multi-hypothesis training, where many futures are generated in parallel and only the best is supervised. At inference, this leads to diverse predictions in a single forward pass. Experiments on dense forecasting tasks demonstrate that DeltaWorld forecasts futures that more closely align with real-world outcomes, while having over 35x fewer parameters and using 2,000x fewer FLOPs than existing generative world models. Code and weights: https://deltatok.github.io.

Paper Structure

This paper contains 50 sections, 10 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Outline of DeltaWorld. Unlike large existing generative world models that require many forward passes and represent each frame with many spatial tokens, our small DeltaWorld generates multiple futures in a single forward pass by using a single delta token to encode the difference between consecutive frames.
  • Figure 2: Performance comparison. Compared to the generative world model Cosmos agarwal2025cosmos, our DeltaWorld forecasts futures that better align with real-world outcomes while having over $35\times$ fewer parameters and using $2{,}000\times$ fewer FLOPs.
  • Figure 3: Overview of DeltaTok. Given two frames encoded by a frozen vision foundation model (VFM) into grids of patch tokens $x_{t-1}$ and $x_t$, the DeltaTok encoder takes both as input and compresses them into a single delta token$z_t$. The decoder reconstructs $\hat{x}_{t}$ from $x_{t-1}$ and $z_t$. Both encoder and decoder are Vision Transformers (ViT) dosovitskiy2021image trained with a Mean Squared Error (MSE) loss.
  • Figure 4: Overview of DeltaWorld. The predictor operates entirely on delta tokens (Fig. \ref{['fig:deltatok']}) rather than spatial tokens, enabling efficient generation of future hypotheses. Best-of-Many training (top) backpropagates only through the best predicted delta token, so that diverse futures can be sampled in a single forward pass at inference (bottom). Shown with two context frames and two queries for illustration.
  • Figure 5: Best-of-Many sample scaling. Effect of the number of training and evaluation queries on Cityscapes mid-horizon (${\sim}0.6$ s) mIoU. Using $256\times256$ crops.
  • ...and 7 more figures