Token Is All You Need: Cognitive Planning through Belief-Intent Co-Evolution
Shiyao Sang
TL;DR
The paper proposes Tokenized Intent World Model (TIWM), a cognitively inspired planning framework that uses a minimal set of semantically rich tokens to represent the scene and co-evolve belief and intent for planning. By extracting sparse tokens from BEV inputs and autoregressively predicting future intents, TIWM conditions trajectory decoding on imagined futures rather than dense scene reconstruction. On nuPlan, TIWM achieves an ADE of 0.487 m without future prediction and improves to 0.382 m (a 21.6% gain) when future intents guide decoding, while explicit reconstruction loss degrades performance, highlighting the power of task-driven semantic alignment and cognitive planning. The work introduces a paradigm shift toward planning through understanding, supported by concepts like temporal fuzziness and cognitive consistency that enable robust, long-horizon decision making and potential applications beyond autonomous driving.
Abstract
We challenge the long-standing assumption that exhaustive scene modeling is required for high-performance end-to-end autonomous driving (E2EAD). Inspired by cognitive science, we propose that effective planning arises not from reconstructing the world, but from the co-evolution of belief and intent within a minimal set of semantically rich tokens. Experiments on the nuPlan benchmark (720 scenarios, 11k+ samples) reveal three principles: (1) sparse intent tokens alone achieve 0.487 m ADE, demonstrating strong performance without future prediction; (2) conditioning trajectory decoding on predicted future tokens reduces ADE to 0.382 m, a 21.6% improvement, showing that performance emerges from cognitive planning; and (3) explicit reconstruction loss degrades performance, confirming that task-driven belief-intent co-evolution suffices under reliable perception inputs. Crucially, we observe the emergence of cognitive consistency: through prolonged training, the model spontaneously develops stable token dynamics that balance current perception (belief) and future goals (intent). This process, accompanied by "temporal fuzziness," enables robustness under uncertainty and continuous self-optimization. Our work establishes a new paradigm: intelligence lies not in pixel fidelity, but in the tokenized duality of belief and intent. By reframing planning as understanding rather than reaction, TIWM bridges the gap between world models and VLA systems, paving the way for foresightful agents that plan through imagination. Note: Numerical comparisons with methods reporting results on nuScenes are indicative only, as nuPlan presents a more challenging planning-focused evaluation.
