Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model

Dongwon Kim; Gawon Seo; Jinsung Lee; Minsu Cho; Suha Kwak

Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model

Dongwon Kim, Gawon Seo, Jinsung Lee, Minsu Cho, Suha Kwak

TL;DR

CompACT, a discrete tokenizer that compresses each observation into as few as 8 tokens, drastically reducing computational cost while preserving essential information for planning, is proposed, offering a practical step toward real-world deployment of world models.

Abstract

World models provide a powerful framework for simulating environment dynamics conditioned on actions or instructions, enabling downstream tasks such as action planning or policy learning. Recent approaches leverage world models as learned simulators, but its application to decision-time planning remains computationally prohibitive for real-time control. A key bottleneck lies in latent representations: conventional tokenizers encode each observation into hundreds of tokens, making planning both slow and resource-intensive. To address this, we propose CompACT, a discrete tokenizer that compresses each observation into as few as 8 tokens, drastically reducing computational cost while preserving essential information for planning. An action-conditioned world model that occupies CompACT tokenizer achieves competitive planning performance with orders-of-magnitude faster planning, offering a practical step toward real-world deployment of world models.

Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model

TL;DR

Abstract

Paper Structure (31 sections, 1 theorem, 8 equations, 16 figures, 12 tables)

This paper contains 31 sections, 1 theorem, 8 equations, 16 figures, 12 tables.

Introduction
Related Work
Image tokenization
Masked generative model
Planning via World Models
Method
Latent generative model as world model
CompACT tokenizer
Semantic encoding via frozen features
Generative decoding
World model in CompACT latent space
Experiment
Experimental Settings
Tokenizer evaluation and ablations
Characterizing CompACT latent tokens
...and 16 more sections

Key Result

Proposition 1

If the optimal planning algorithm $\pi$ is deterministic, i.e., $H({\bm{a}}^{*}\mid{\bm{o}})=0$, then a planning-sufficient representation ${\bm{z}}$ (Def. def:planning_sufficiency) exists with minimum entropy established by necessity (no planning-sufficient ${\bm{z}}$ can have lower entropy) and achievability (a ${\bm{z}}$ attaining this bound exists).

Figures (16)

Figure 1: Overview of the proposed latent world model formulation (Sec. \ref{['sec:formulation']}). (a) An image tokenizer is first trained with a reconstruction objective to map an input image into compact latent tokens ${\bm{z}}$. (Fig. \ref{['fig:tok_detail']} and Sec. \ref{['sec:compacttok']}). (b) Using the learned tokenizer, latent world model $f_{\phi}({\bm{z}}_t, {\bm{a}}_t)$ is trained to model the conditional distribution of the future state $p_{\phi}({\bm{z}}_{t+1} | {\bm{z}}_{t}, {\bm{a}}_{t})$, where we adopt masked generative modeling (Sec. \ref{['sec:compact_lwm']}). (c) At test time, the learned latent world model is used for decision-time planning: An optimization procedure (e.g., MPC with CEM) searches over actions ${\bm{a}}_{0:H-1}$ to minimize the distance between the predicted final state and a goal image.
Figure 2: A tokenizer architecture detail. During training, only the latent resampler and $\mathcal{D}_\textrm{compact}$ are updated. $\mathcal{E}_{\psi}$ produces masked target tokens (training only), while $\mathcal{D}_{\psi}$ is used only during inference for pixel level reconstruction.
Figure 3: CompACT encoder $\mathcal{E}_\textrm{compact}$ architecture variation. (a) ViT (scratch)+ [REG]: Initial latent tokens are concatenated to the input patch tokens. This design follows previous transformer-based image tokenizers yu2024imagebachmann2025flextokyu2021vector. (b) DINOv3 simeoni2025dinov3 + [REG]: Similar to (a), but encoder is initialized with Dinov3. (c) DINOv3 simeoni2025dinov3 + latent resampler: latent resampler and Dinov3 initialized encoder. Dino and ViT are updated during training in these variants.
Figure 4: Attention visualization for compact latent token in latent resampler. Brighter the color, higher the attention score.
Figure 5: Qualitative results of planning with the proposed CompACT. Best and Worst 1&2 denote the final rollouts corresponding to the simulated trajectories with the minimum and maximum cost, respectively.
...and 11 more figures

Theorems & Definitions (3)

Definition 1: Planning sufficiency
Proposition 1: Minimum description length for planning
proof

Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model

TL;DR

Abstract

Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (3)