Table of Contents
Fetching ...

Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model

Dongwon Kim, Gawon Seo, Jinsung Lee, Minsu Cho, Suha Kwak

TL;DR

CompACT, a discrete tokenizer that compresses each observation into as few as 8 tokens, drastically reducing computational cost while preserving essential information for planning, is proposed, offering a practical step toward real-world deployment of world models.

Abstract

World models provide a powerful framework for simulating environment dynamics conditioned on actions or instructions, enabling downstream tasks such as action planning or policy learning. Recent approaches leverage world models as learned simulators, but its application to decision-time planning remains computationally prohibitive for real-time control. A key bottleneck lies in latent representations: conventional tokenizers encode each observation into hundreds of tokens, making planning both slow and resource-intensive. To address this, we propose CompACT, a discrete tokenizer that compresses each observation into as few as 8 tokens, drastically reducing computational cost while preserving essential information for planning. An action-conditioned world model that occupies CompACT tokenizer achieves competitive planning performance with orders-of-magnitude faster planning, offering a practical step toward real-world deployment of world models.

Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model

TL;DR

CompACT, a discrete tokenizer that compresses each observation into as few as 8 tokens, drastically reducing computational cost while preserving essential information for planning, is proposed, offering a practical step toward real-world deployment of world models.

Abstract

World models provide a powerful framework for simulating environment dynamics conditioned on actions or instructions, enabling downstream tasks such as action planning or policy learning. Recent approaches leverage world models as learned simulators, but its application to decision-time planning remains computationally prohibitive for real-time control. A key bottleneck lies in latent representations: conventional tokenizers encode each observation into hundreds of tokens, making planning both slow and resource-intensive. To address this, we propose CompACT, a discrete tokenizer that compresses each observation into as few as 8 tokens, drastically reducing computational cost while preserving essential information for planning. An action-conditioned world model that occupies CompACT tokenizer achieves competitive planning performance with orders-of-magnitude faster planning, offering a practical step toward real-world deployment of world models.
Paper Structure (31 sections, 1 theorem, 8 equations, 16 figures, 12 tables)

This paper contains 31 sections, 1 theorem, 8 equations, 16 figures, 12 tables.

Key Result

Proposition 1

If the optimal planning algorithm $\pi$ is deterministic, i.e., $H({\bm{a}}^{*}\mid{\bm{o}})=0$, then a planning-sufficient representation ${\bm{z}}$ (Def. def:planning_sufficiency) exists with minimum entropy established by necessity (no planning-sufficient ${\bm{z}}$ can have lower entropy) and achievability (a ${\bm{z}}$ attaining this bound exists).

Figures (16)

  • Figure 1: Overview of the proposed latent world model formulation (Sec. \ref{['sec:formulation']}). (a) An image tokenizer is first trained with a reconstruction objective to map an input image into compact latent tokens ${\bm{z}}$. (Fig. \ref{['fig:tok_detail']} and Sec. \ref{['sec:compacttok']}). (b) Using the learned tokenizer, latent world model $f_{\phi}({\bm{z}}_t, {\bm{a}}_t)$ is trained to model the conditional distribution of the future state $p_{\phi}({\bm{z}}_{t+1} | {\bm{z}}_{t}, {\bm{a}}_{t})$, where we adopt masked generative modeling (Sec. \ref{['sec:compact_lwm']}). (c) At test time, the learned latent world model is used for decision-time planning: An optimization procedure (e.g., MPC with CEM) searches over actions ${\bm{a}}_{0:H-1}$ to minimize the distance between the predicted final state and a goal image.
  • Figure 2: A tokenizer architecture detail. During training, only the latent resampler and $\mathcal{D}_\textrm{compact}$ are updated. $\mathcal{E}_{\psi}$ produces masked target tokens (training only), while $\mathcal{D}_{\psi}$ is used only during inference for pixel level reconstruction.
  • Figure 3: CompACT encoder $\mathcal{E}_\textrm{compact}$ architecture variation. (a) ViT (scratch)+ [REG]: Initial latent tokens are concatenated to the input patch tokens. This design follows previous transformer-based image tokenizers yu2024imagebachmann2025flextokyu2021vector. (b) DINOv3 simeoni2025dinov3 + [REG]: Similar to (a), but encoder is initialized with Dinov3. (c) DINOv3 simeoni2025dinov3 + latent resampler: latent resampler and Dinov3 initialized encoder. Dino and ViT are updated during training in these variants.
  • Figure 4: Attention visualization for compact latent token in latent resampler. Brighter the color, higher the attention score.
  • Figure 5: Qualitative results of planning with the proposed CompACT. Best and Worst 1&2 denote the final rollouts corresponding to the simulated trajectories with the minimum and maximum cost, respectively.
  • ...and 11 more figures

Theorems & Definitions (3)

  • Definition 1: Planning sufficiency
  • Proposition 1: Minimum description length for planning
  • proof