Table of Contents
Fetching ...

Token Is All You Need: Cognitive Planning through Belief-Intent Co-Evolution

Shiyao Sang

TL;DR

The paper proposes Tokenized Intent World Model (TIWM), a cognitively inspired planning framework that uses a minimal set of semantically rich tokens to represent the scene and co-evolve belief and intent for planning. By extracting sparse tokens from BEV inputs and autoregressively predicting future intents, TIWM conditions trajectory decoding on imagined futures rather than dense scene reconstruction. On nuPlan, TIWM achieves an ADE of 0.487 m without future prediction and improves to 0.382 m (a 21.6% gain) when future intents guide decoding, while explicit reconstruction loss degrades performance, highlighting the power of task-driven semantic alignment and cognitive planning. The work introduces a paradigm shift toward planning through understanding, supported by concepts like temporal fuzziness and cognitive consistency that enable robust, long-horizon decision making and potential applications beyond autonomous driving.

Abstract

We challenge the long-standing assumption that exhaustive scene modeling is required for high-performance end-to-end autonomous driving (E2EAD). Inspired by cognitive science, we propose that effective planning arises not from reconstructing the world, but from the co-evolution of belief and intent within a minimal set of semantically rich tokens. Experiments on the nuPlan benchmark (720 scenarios, 11k+ samples) reveal three principles: (1) sparse intent tokens alone achieve 0.487 m ADE, demonstrating strong performance without future prediction; (2) conditioning trajectory decoding on predicted future tokens reduces ADE to 0.382 m, a 21.6% improvement, showing that performance emerges from cognitive planning; and (3) explicit reconstruction loss degrades performance, confirming that task-driven belief-intent co-evolution suffices under reliable perception inputs. Crucially, we observe the emergence of cognitive consistency: through prolonged training, the model spontaneously develops stable token dynamics that balance current perception (belief) and future goals (intent). This process, accompanied by "temporal fuzziness," enables robustness under uncertainty and continuous self-optimization. Our work establishes a new paradigm: intelligence lies not in pixel fidelity, but in the tokenized duality of belief and intent. By reframing planning as understanding rather than reaction, TIWM bridges the gap between world models and VLA systems, paving the way for foresightful agents that plan through imagination. Note: Numerical comparisons with methods reporting results on nuScenes are indicative only, as nuPlan presents a more challenging planning-focused evaluation.

Token Is All You Need: Cognitive Planning through Belief-Intent Co-Evolution

TL;DR

The paper proposes Tokenized Intent World Model (TIWM), a cognitively inspired planning framework that uses a minimal set of semantically rich tokens to represent the scene and co-evolve belief and intent for planning. By extracting sparse tokens from BEV inputs and autoregressively predicting future intents, TIWM conditions trajectory decoding on imagined futures rather than dense scene reconstruction. On nuPlan, TIWM achieves an ADE of 0.487 m without future prediction and improves to 0.382 m (a 21.6% gain) when future intents guide decoding, while explicit reconstruction loss degrades performance, highlighting the power of task-driven semantic alignment and cognitive planning. The work introduces a paradigm shift toward planning through understanding, supported by concepts like temporal fuzziness and cognitive consistency that enable robust, long-horizon decision making and potential applications beyond autonomous driving.

Abstract

We challenge the long-standing assumption that exhaustive scene modeling is required for high-performance end-to-end autonomous driving (E2EAD). Inspired by cognitive science, we propose that effective planning arises not from reconstructing the world, but from the co-evolution of belief and intent within a minimal set of semantically rich tokens. Experiments on the nuPlan benchmark (720 scenarios, 11k+ samples) reveal three principles: (1) sparse intent tokens alone achieve 0.487 m ADE, demonstrating strong performance without future prediction; (2) conditioning trajectory decoding on predicted future tokens reduces ADE to 0.382 m, a 21.6% improvement, showing that performance emerges from cognitive planning; and (3) explicit reconstruction loss degrades performance, confirming that task-driven belief-intent co-evolution suffices under reliable perception inputs. Crucially, we observe the emergence of cognitive consistency: through prolonged training, the model spontaneously develops stable token dynamics that balance current perception (belief) and future goals (intent). This process, accompanied by "temporal fuzziness," enables robustness under uncertainty and continuous self-optimization. Our work establishes a new paradigm: intelligence lies not in pixel fidelity, but in the tokenized duality of belief and intent. By reframing planning as understanding rather than reaction, TIWM bridges the gap between world models and VLA systems, paving the way for foresightful agents that plan through imagination. Note: Numerical comparisons with methods reporting results on nuScenes are indicative only, as nuPlan presents a more challenging planning-focused evaluation.

Paper Structure

This paper contains 16 sections, 7 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Tokenized Intent World Model: from perception to cognitive world. At each timestep, sparse tokens are extracted from BEV perception features, representing the distilled semantics of the scene. The model then autoregressively predicts future intent tokens—compact, task-relevant abstractions of the agent’s imagined goals. The planning decoder jointly reasons over current sparse tokens and predicted future intents to generate executable trajectories. Each intent token serves three roles: (1) Representation — compressing the core cognitive state of the environment; (2) Intent — unfolding plausible futures through autoregressive prediction; (3) Decision — directly guiding the trajectory decoder toward goal-directed action. In this framework, the world becomes an actionable cognitive entity—interpreted, imagined, and utilized by the decision system—rather than a dense intermediate reconstruction.
  • Figure 2: Validation ADE (left) and training loss (right) versus epoch across four configurations. All runs converge stably, with late-epoch best ADEs between 0.382 and 0.492 m. Early-epoch behavior is configuration-dependent: with current tokens, intent conditioning yields modest early gains (lower mean ADE within the first 100 epochs), whereas with future tokens, the non-intent variant reaches lower ADE faster. The overall best minimum is obtained by the future-token model without intent.
  • Figure 3: Training dynamics for the "Future token without intent loss" configuration. Left: Validation ADE (m) converges to a minimum of 0.263 m at epoch 860. Right: Training loss decreases steadily with epoch, indicating continuous optimization. This sustained convergence demonstrates the model's capacity for prolonged learning and its potential for further performance gains.