Table of Contents
Fetching ...

Percept-WAM: Perception-Enhanced World-Awareness-Action Model for Robust End-to-End Autonomous Driving

Jianhua Han, Meng Tian, Jiangtong Zhu, Fan He, Huixin Zhang, Sitong Guo, Dechang Zhu, Hao Tang, Pei Xu, Yuze Guo, Minzhe Niu, Haojie Zhu, Qichao Dong, Xuechao Yan, Siyuan Dong, Lu Hou, Qingqiu Huang, Xiaosong Jia, Hang Xu

TL;DR

Percept-WAM tackles the fragility of spatial perception in autonomous driving by embedding explicit 2D/3D world states into a single vision–language model through World-PV and World-BEV tokens. It unifies perception and planning with grid-conditioned dense predictions, IoU-calibrated confidence, and parallel autoregressive trajectory decoding, while enabling streaming inference for real-time operation. The approach achieves strong 2D and BEV perception benchmarks and substantially improves end-to-end planning on nuScenes and NAVSIM, demonstrating robust open-vocabulary and long-tail generalization. Overall, Percept-WAM bridges perception and action within a unified VLM, offering a scalable path toward robust, end-to-end autonomous driving.

Abstract

Autonomous driving heavily relies on accurate and robust spatial perception. Many failures arise from inaccuracies and instability, especially in long-tail scenarios and complex interactions. However, current vision-language models are weak at spatial grounding and understanding, and VLA systems built on them therefore show limited perception and localization ability. To address these challenges, we introduce Percept-WAM, a perception-enhanced World-Awareness-Action Model that is the first to implicitly integrate 2D/3D scene understanding abilities within a single vision-language model (VLM). Instead of relying on QA-style spatial reasoning, Percept-WAM unifies 2D/3D perception tasks into World-PV and World-BEV tokens, which encode both spatial coordinates and confidence. We propose a grid-conditioned prediction mechanism for dense object perception, incorporating IoU-aware scoring and parallel autoregressive decoding, improving stability in long-tail, far-range, and small-object scenarios. Additionally, Percept-WAM leverages pretrained VLM parameters to retain general intelligence (e.g., logical reasoning) and can output perception results and trajectory control outputs directly. Experiments show that Percept-WAM matches or surpasses classical detectors and segmenters on downstream perception benchmarks, achieving 51.7/58.9 mAP on COCO 2D detection and nuScenes BEV 3D detection. When integrated with trajectory decoders, it further improves planning performance on nuScenes and NAVSIM, e.g., surpassing DiffusionDrive by 2.1 in PMDS on NAVSIM. Qualitative results further highlight its strong open-vocabulary and long-tail generalization.

Percept-WAM: Perception-Enhanced World-Awareness-Action Model for Robust End-to-End Autonomous Driving

TL;DR

Percept-WAM tackles the fragility of spatial perception in autonomous driving by embedding explicit 2D/3D world states into a single vision–language model through World-PV and World-BEV tokens. It unifies perception and planning with grid-conditioned dense predictions, IoU-calibrated confidence, and parallel autoregressive trajectory decoding, while enabling streaming inference for real-time operation. The approach achieves strong 2D and BEV perception benchmarks and substantially improves end-to-end planning on nuScenes and NAVSIM, demonstrating robust open-vocabulary and long-tail generalization. Overall, Percept-WAM bridges perception and action within a unified VLM, offering a scalable path toward robust, end-to-end autonomous driving.

Abstract

Autonomous driving heavily relies on accurate and robust spatial perception. Many failures arise from inaccuracies and instability, especially in long-tail scenarios and complex interactions. However, current vision-language models are weak at spatial grounding and understanding, and VLA systems built on them therefore show limited perception and localization ability. To address these challenges, we introduce Percept-WAM, a perception-enhanced World-Awareness-Action Model that is the first to implicitly integrate 2D/3D scene understanding abilities within a single vision-language model (VLM). Instead of relying on QA-style spatial reasoning, Percept-WAM unifies 2D/3D perception tasks into World-PV and World-BEV tokens, which encode both spatial coordinates and confidence. We propose a grid-conditioned prediction mechanism for dense object perception, incorporating IoU-aware scoring and parallel autoregressive decoding, improving stability in long-tail, far-range, and small-object scenarios. Additionally, Percept-WAM leverages pretrained VLM parameters to retain general intelligence (e.g., logical reasoning) and can output perception results and trajectory control outputs directly. Experiments show that Percept-WAM matches or surpasses classical detectors and segmenters on downstream perception benchmarks, achieving 51.7/58.9 mAP on COCO 2D detection and nuScenes BEV 3D detection. When integrated with trajectory decoders, it further improves planning performance on nuScenes and NAVSIM, e.g., surpassing DiffusionDrive by 2.1 in PMDS on NAVSIM. Qualitative results further highlight its strong open-vocabulary and long-tail generalization.

Paper Structure

This paper contains 24 sections, 2 equations, 15 figures, 14 tables.

Figures (15)

  • Figure 1: Comparison of VLA paradigms. (i) QA-style supervision frames spatial understanding as question answering hwang2024emma, providing only indirect localization and yielding weak perception. (ii) Diffusion–decoder pipelines zheng2025diffusion offer generative control but lack LLM-level reasoning. (iii) Percep-WAM implicitly integrates 2D/3D scene understanding within a single VLM, enabling robust perception and trajectory prediction in complex scenarios.
  • Figure 2: The overall architecture of Percept-WAM. i) We use a pretrained VLM backbone to maintain general reasoning capability, ii) Percept-WAM unifies 2D and 3D perception via World-PV and World-BEV tokens. The learnable BEV-Level grid tokens implicitly model the mapping from PV features to BEV-space representations. iii) An Action Head is introduced to predict trajectories from world tokens via parallel decoding. An additional memory bank is introduced to support efficient streaming inference.
  • Figure 3: Illustration of IoU-based confidence training strategy. (a) The confidence-tuning dataset is generated by model predictions on the GT dataset. This yields scores that better match real distributions and reduce false positives compared to training on a random perturbation dataset. (b) During training, different dataset strategies are supervised through a loss-mask scheme that promotes precise box and confidence tokens learning.
  • Figure 4: Illustration of grid query tokens in dense prediction. Note that the grid tokens are interpolated from World-PV or World-BEV tokens to predict the matched bounding box.
  • Figure 5: Trajectory decoding. Four sets of point-level queries interact with different input modality information, and generate trajectory using MLP. $\mathbf{Q}_\text{ego}$, $\mathbf{Q}_\text{pv}$, and $\mathbf{Q}_\text{bev}$ are aligned with their corresponding modality tokens via attention masking, while $\mathbf{Q}_\text{full}$ accesses all features to decode the final trajectory.
  • ...and 10 more figures