Table of Contents
Fetching ...

DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers

Yuntao Chen, Yuqi Wang, Zhaoxiang Zhang

TL;DR

This work tackles the challenge of unifying world modeling and trajectory planning for autonomous driving. It introduces DrivingGPT, a multimodal autoregressive transformer that tokenizes front-view images and relative actions into a single driving language and predicts next tokens to perform both video generation and end-to-end planning. Key contributions include a multimodal tokenization scheme (VQ-VAE for images and percentile-based action bins), interleaved visual-action sequences, and frame-wise rotary embeddings enabling joint learning of world dynamics and planning. Empirical results on nuPlan and NAVSIM demonstrate competitive action-conditioned video generation and planning performance, outperforming diffusion-based baselines and simple planners. The work establishes the viability of differentiable, unified planning with a single model and points to future directions for long-horizon, multi-modal autonomous driving systems.

Abstract

World model-based searching and planning are widely recognized as a promising path toward human-level physical intelligence. However, current driving world models primarily rely on video diffusion models, which specialize in visual generation but lack the flexibility to incorporate other modalities like action. In contrast, autoregressive transformers have demonstrated exceptional capability in modeling multimodal data. Our work aims to unify both driving model simulation and trajectory planning into a single sequence modeling problem. We introduce a multimodal driving language based on interleaved image and action tokens, and develop DrivingGPT to learn joint world modeling and planning through standard next-token prediction. Our DrivingGPT demonstrates strong performance in both action-conditioned video generation and end-to-end planning, outperforming strong baselines on large-scale nuPlan and NAVSIM benchmarks.

DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers

TL;DR

This work tackles the challenge of unifying world modeling and trajectory planning for autonomous driving. It introduces DrivingGPT, a multimodal autoregressive transformer that tokenizes front-view images and relative actions into a single driving language and predicts next tokens to perform both video generation and end-to-end planning. Key contributions include a multimodal tokenization scheme (VQ-VAE for images and percentile-based action bins), interleaved visual-action sequences, and frame-wise rotary embeddings enabling joint learning of world dynamics and planning. Empirical results on nuPlan and NAVSIM demonstrate competitive action-conditioned video generation and planning performance, outperforming diffusion-based baselines and simple planners. The work establishes the viability of differentiable, unified planning with a single model and points to future directions for long-horizon, multi-modal autonomous driving systems.

Abstract

World model-based searching and planning are widely recognized as a promising path toward human-level physical intelligence. However, current driving world models primarily rely on video diffusion models, which specialize in visual generation but lack the flexibility to incorporate other modalities like action. In contrast, autoregressive transformers have demonstrated exceptional capability in modeling multimodal data. Our work aims to unify both driving model simulation and trajectory planning into a single sequence modeling problem. We introduce a multimodal driving language based on interleaved image and action tokens, and develop DrivingGPT to learn joint world modeling and planning through standard next-token prediction. Our DrivingGPT demonstrates strong performance in both action-conditioned video generation and end-to-end planning, outperforming strong baselines on large-scale nuPlan and NAVSIM benchmarks.

Paper Structure

This paper contains 32 sections, 2 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Driving as next token prediction. Our DrivingGPT treat interleaved discrete visual and action tokens of a driving sequence as a unified driving language and leverage multimodal autoregressive transformers to simultaneously perform world modeling and end-to-end planning by standard next token prediction given historical driving tokens. The red rectangle in planning denotes the ego car and the blue line is the generated trajectory while the brown line is the human driving trajectory.
  • Figure 2: Detailed network architecture and data flow of DrivingGPT. Front camera driving images are tokenized by VQ-VAE and driving actions are tokenized via component-wise binning. Image tokens and action tokens are interleaved to form a driving language. Standard LLM architecture and next token prediction training strategy are used. The predicted image tokens are grouped and decoded back to image via VQ-VAE decoder while the predicted action tokens are unbinned to get the driving trajectory.
  • Figure 3: Comparison of long video generation. We showcase a 64-frame (32-second) sequence generated on the navtest dataset. (a) SVD fine-tuning methods often exhibit limitations in generating long videos, frequently repeating past content, such as indefinitely remaining at a red light. Conversely, (b) our DrivingGPT demonstrates superior performance in generating long, diverse, and visually appealing videos.
  • Figure 4: Object hallucination. Top: Diffusion-based methods often exhibit object hallucination phenomena. For instance, when comparing models fine-tuned with SVD, we observe the sudden appearance (red box) and gradual disappearance (green box) of objects. Bottom: In contrast, our autoregressive approach maintains better consistency.
  • Figure 5: DrivingGPT planning results in complex driving scenes: (a) Unprotected left turn; (b) Large curvature turn; (c) Merging into traffic; (d) Take better path than human. The red rectangle denotes the ego car, the blue line is the generated trajectory, and the brown line is the human driving trajectory.
  • ...and 2 more figures