Table of Contents
Fetching ...

GPD-1: Generative Pre-training for Driving

Zixun Xie, Sicheng Zuo, Wenzhao Zheng, Yunpeng Zhang, Dalong Du, Jie Zhou, Jiwen Lu, Shanghang Zhang

TL;DR

GPD-1 introduces a unified generative pre-training framework for driving that encodes the scene as tokens for the ego vehicle, surrounding agents, and the map, and uses a GPT-like transformer with a scene-level mask to forecast scene evolution. It deploys a 2D BEV map tokenizer based on VQ-VAE and a hierarchical agent tokenizer to produce discrete tokens, enabling efficient, multi-frame prediction without task-specific fine-tuning. The model is pre-trained on nuPlan with a two-stage training regime and demonstrates zero-shot and few-shot generalization to scene generation, traffic simulation, closed-loop simulation, map prediction, and motion planning, achieving competitive results and robust behavior in complex scenarios. This work advances autonomous driving research by offering an interpretable, token-based, data-driven framework capable of integrated scene understanding and planning, with practical implications for realistic simulation and downstream decision-making.

Abstract

Modeling the evolutions of driving scenarios is important for the evaluation and decision-making of autonomous driving systems. Most existing methods focus on one aspect of scene evolution such as map generation, motion prediction, and trajectory planning. In this paper, we propose a unified Generative Pre-training for Driving (GPD-1) model to accomplish all these tasks altogether without additional fine-tuning. We represent each scene with ego, agent, and map tokens and formulate autonomous driving as a unified token generation problem. We adopt the autoregressive transformer architecture and use a scene-level attention mask to enable intra-scene bi-directional interactions. For the ego and agent tokens, we propose a hierarchical positional tokenizer to effectively encode both 2D positions and headings. For the map tokens, we train a map vector-quantized autoencoder to efficiently compress ego-centric semantic maps into discrete tokens. We pre-train our GPD-1 on the large-scale nuPlan dataset and conduct extensive experiments to evaluate its effectiveness. With different prompts, our GPD-1 successfully generalizes to various tasks without finetuning, including scene generation, traffic simulation, closed-loop simulation, map prediction, and motion planning. Code: https://github.com/wzzheng/GPD.

GPD-1: Generative Pre-training for Driving

TL;DR

GPD-1 introduces a unified generative pre-training framework for driving that encodes the scene as tokens for the ego vehicle, surrounding agents, and the map, and uses a GPT-like transformer with a scene-level mask to forecast scene evolution. It deploys a 2D BEV map tokenizer based on VQ-VAE and a hierarchical agent tokenizer to produce discrete tokens, enabling efficient, multi-frame prediction without task-specific fine-tuning. The model is pre-trained on nuPlan with a two-stage training regime and demonstrates zero-shot and few-shot generalization to scene generation, traffic simulation, closed-loop simulation, map prediction, and motion planning, achieving competitive results and robust behavior in complex scenarios. This work advances autonomous driving research by offering an interpretable, token-based, data-driven framework capable of integrated scene understanding and planning, with practical implications for realistic simulation and downstream decision-making.

Abstract

Modeling the evolutions of driving scenarios is important for the evaluation and decision-making of autonomous driving systems. Most existing methods focus on one aspect of scene evolution such as map generation, motion prediction, and trajectory planning. In this paper, we propose a unified Generative Pre-training for Driving (GPD-1) model to accomplish all these tasks altogether without additional fine-tuning. We represent each scene with ego, agent, and map tokens and formulate autonomous driving as a unified token generation problem. We adopt the autoregressive transformer architecture and use a scene-level attention mask to enable intra-scene bi-directional interactions. For the ego and agent tokens, we propose a hierarchical positional tokenizer to effectively encode both 2D positions and headings. For the map tokens, we train a map vector-quantized autoencoder to efficiently compress ego-centric semantic maps into discrete tokens. We pre-train our GPD-1 on the large-scale nuPlan dataset and conduct extensive experiments to evaluate its effectiveness. With different prompts, our GPD-1 successfully generalizes to various tasks without finetuning, including scene generation, traffic simulation, closed-loop simulation, map prediction, and motion planning. Code: https://github.com/wzzheng/GPD.

Paper Structure

This paper contains 17 sections, 7 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Given past 2D BEV observations, our pre-trained GPD-1 model can jointly predict future scene evolution and agent movements. This task requires both spatial understanding of the 2D scene and temporal modeling of how driving scenarios progress. We observe that GPD-1 successfully forecasts the movements of surrounding agents and future map elements. Remarkably, it even generates more plausible drivable areas than the ground truth, showcasing its capacity to understand the scene rather than merely memorizing training data. However, it struggles to anticipate new vehicles entering the field of view, which is challenging due to their absence in the input data.
  • Figure 2: Illustration of the agent tokenizer. We use a set of thresholds to categorize agent states into different ranges to convert continuous information into discrete representations.
  • Figure 3: Framework of our GPD-1 model for 2D scene forecasting and motion planning. Our model adapts the GPT-like architecture for autonomous driving scenarios with two key innovations: 1) a 2D map scene tokenizer that generates discrete high-level representations of the 2D BEV map, and 2) a hierarchical quantization agent tokenizer to encode agent information. Using a scene-level mask, the autoregressive transformer predicts future scenes by conditioning on both ground truth and previously predicted scene tokens during training and inference, respectively.
  • Figure 4: Visualizations of the Scene Generation Task across different types of scenarios.
  • Figure 5: Visualizations of the scene generation, traffic simulation, closed-loop simulation, and motion planning tasks.
  • ...and 1 more figures