GPD-1: Generative Pre-training for Driving
Zixun Xie, Sicheng Zuo, Wenzhao Zheng, Yunpeng Zhang, Dalong Du, Jie Zhou, Jiwen Lu, Shanghang Zhang
TL;DR
GPD-1 introduces a unified generative pre-training framework for driving that encodes the scene as tokens for the ego vehicle, surrounding agents, and the map, and uses a GPT-like transformer with a scene-level mask to forecast scene evolution. It deploys a 2D BEV map tokenizer based on VQ-VAE and a hierarchical agent tokenizer to produce discrete tokens, enabling efficient, multi-frame prediction without task-specific fine-tuning. The model is pre-trained on nuPlan with a two-stage training regime and demonstrates zero-shot and few-shot generalization to scene generation, traffic simulation, closed-loop simulation, map prediction, and motion planning, achieving competitive results and robust behavior in complex scenarios. This work advances autonomous driving research by offering an interpretable, token-based, data-driven framework capable of integrated scene understanding and planning, with practical implications for realistic simulation and downstream decision-making.
Abstract
Modeling the evolutions of driving scenarios is important for the evaluation and decision-making of autonomous driving systems. Most existing methods focus on one aspect of scene evolution such as map generation, motion prediction, and trajectory planning. In this paper, we propose a unified Generative Pre-training for Driving (GPD-1) model to accomplish all these tasks altogether without additional fine-tuning. We represent each scene with ego, agent, and map tokens and formulate autonomous driving as a unified token generation problem. We adopt the autoregressive transformer architecture and use a scene-level attention mask to enable intra-scene bi-directional interactions. For the ego and agent tokens, we propose a hierarchical positional tokenizer to effectively encode both 2D positions and headings. For the map tokens, we train a map vector-quantized autoencoder to efficiently compress ego-centric semantic maps into discrete tokens. We pre-train our GPD-1 on the large-scale nuPlan dataset and conduct extensive experiments to evaluate its effectiveness. With different prompts, our GPD-1 successfully generalizes to various tasks without finetuning, including scene generation, traffic simulation, closed-loop simulation, map prediction, and motion planning. Code: https://github.com/wzzheng/GPD.
