DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers
Yuntao Chen, Yuqi Wang, Zhaoxiang Zhang
TL;DR
This work tackles the challenge of unifying world modeling and trajectory planning for autonomous driving. It introduces DrivingGPT, a multimodal autoregressive transformer that tokenizes front-view images and relative actions into a single driving language and predicts next tokens to perform both video generation and end-to-end planning. Key contributions include a multimodal tokenization scheme (VQ-VAE for images and percentile-based action bins), interleaved visual-action sequences, and frame-wise rotary embeddings enabling joint learning of world dynamics and planning. Empirical results on nuPlan and NAVSIM demonstrate competitive action-conditioned video generation and planning performance, outperforming diffusion-based baselines and simple planners. The work establishes the viability of differentiable, unified planning with a single model and points to future directions for long-horizon, multi-modal autonomous driving systems.
Abstract
World model-based searching and planning are widely recognized as a promising path toward human-level physical intelligence. However, current driving world models primarily rely on video diffusion models, which specialize in visual generation but lack the flexibility to incorporate other modalities like action. In contrast, autoregressive transformers have demonstrated exceptional capability in modeling multimodal data. Our work aims to unify both driving model simulation and trajectory planning into a single sequence modeling problem. We introduce a multimodal driving language based on interleaved image and action tokens, and develop DrivingGPT to learn joint world modeling and planning through standard next-token prediction. Our DrivingGPT demonstrates strong performance in both action-conditioned video generation and end-to-end planning, outperforming strong baselines on large-scale nuPlan and NAVSIM benchmarks.
