Table of Contents
Fetching ...

DriveGPT: Scaling Autoregressive Behavior Models for Driving

Xin Huang, Eric M. Wolff, Paul Vernaza, Tung Phan-Minh, Hongge Chen, David S. Hayden, Mark Edmonds, Brian Pierce, Xinxin Chen, Pratik Elias Jacob, Xiaobai Chen, Chingiz Tairbekov, Pratik Agarwal, Tianshi Gao, Yuning Chai, Siddhartha Srinivasa

TL;DR

DriveGPT investigates scaling laws for autoregressive behavior models in autonomous driving by systematically varying data size, model capacity, and compute. It employs a transformer encoder to fuse scene context and an LLM-style autoregressive decoder to generate future agent trajectories as Verlet actions, trained on massive driving data and evaluated in planning, prediction, and closed-loop settings. Key findings show data scaling as the primary bottleneck, with model and compute scaling providing diminishing returns beyond certain points, and the autoregressive decoder delivering robust planning performance and competitive motion prediction after pretraining. The work demonstrates real-time viability for closed-loop driving and offers practical guidance on scaling strategies for safer, more robust autonomous driving systems.

Abstract

We present DriveGPT, a scalable behavior model for autonomous driving. We model driving as a sequential decision-making task, and learn a transformer model to predict future agent states as tokens in an autoregressive fashion. We scale up our model parameters and training data by multiple orders of magnitude, enabling us to explore the scaling properties in terms of dataset size, model parameters, and compute. We evaluate DriveGPT across different scales in a planning task, through both quantitative metrics and qualitative examples, including closed-loop driving in complex real-world scenarios. In a separate prediction task, DriveGPT outperforms state-of-the-art baselines and exhibits improved performance by pretraining on a large-scale dataset, further validating the benefits of data scaling.

DriveGPT: Scaling Autoregressive Behavior Models for Driving

TL;DR

DriveGPT investigates scaling laws for autoregressive behavior models in autonomous driving by systematically varying data size, model capacity, and compute. It employs a transformer encoder to fuse scene context and an LLM-style autoregressive decoder to generate future agent trajectories as Verlet actions, trained on massive driving data and evaluated in planning, prediction, and closed-loop settings. Key findings show data scaling as the primary bottleneck, with model and compute scaling providing diminishing returns beyond certain points, and the autoregressive decoder delivering robust planning performance and competitive motion prediction after pretraining. The work demonstrates real-time viability for closed-loop driving and offers practical guidance on scaling strategies for safer, more robust autonomous driving systems.

Abstract

We present DriveGPT, a scalable behavior model for autonomous driving. We model driving as a sequential decision-making task, and learn a transformer model to predict future agent states as tokens in an autoregressive fashion. We scale up our model parameters and training data by multiple orders of magnitude, enabling us to explore the scaling properties in terms of dataset size, model parameters, and compute. We evaluate DriveGPT across different scales in a planning task, through both quantitative metrics and qualitative examples, including closed-loop driving in complex real-world scenarios. In a separate prediction task, DriveGPT outperforms state-of-the-art baselines and exhibits improved performance by pretraining on a large-scale dataset, further validating the benefits of data scaling.

Paper Structure

This paper contains 43 sections, 2 equations, 17 figures, 8 tables.

Figures (17)

  • Figure 1: Through data and model scaling, DriveGPT (red) handles complex real-world driving scenarios, such as lane changing in heavy traffic and yielding to a cyclist in the opposite lane, compared to a smaller baseline trained on less data (pink).
  • Figure 2: DriveGPT architecture, including a transformer encoder and a transformer decoder. The transformer encoder summarizes relevant scene context, such as target agent history, nearby agent history, and map information, into a set of scene embedding tokens. The transformer decoder follows an LLM-style architecture that takes a sequence of agent states as input and predicts a discrete distribution of actions at the next tick, conditioned on previous states.
  • Figure 3: Performance increases with dataset size across a range of model parameters, indicating that data is a limiting factor. Both axes are on a log scale. An exponential fit was applied to all data points except for the 1.5M curve, resulting in the following relationship: $\log(L) = -0.102 \log(D) + 2.663$ with an $R^2$ value of 0.986, where $L$ is the validation loss and $D$ is the number of unique training samples.
  • Figure 4: Model scaling is more effective as training data increases. The validation loss improves up to $\sim$100M parameters when trained on the full dataset.
  • Figure 5: Relationship between (smoothed) training loss and FLOPs. Each curve represents an experiment corresponding to a specific model size, and the "min bound" indicates the best performance possible for a given FLOP budget.
  • ...and 12 more figures