Table of Contents
Fetching ...

Solving Motion Planning Tasks with a Scalable Generative Model

Yihan Hu, Siqi Chai, Zhening Yang, Jingyu Qian, Kun Li, Wenxin Shao, Haichao Zhang, Wei Xu, Qiang Liu

TL;DR

GUMP introduces a scalable, unified generative world model for autonomous driving that learns driving scene dynamics to enable data generation, closed-loop simulation, planning evaluation, and online RL. It couples a simple key-value tokenizer with a Multimodal Causal Transformer and prediction chunking to support long-horizon generation and efficient inference via partial-AR decoding. The model achieves state-of-the-art results on Waymo Sim Agents and nuPlan planning benchmarks while enabling a flexible online training framework, positioning GUMP as a foundation model for motion planning tasks. The work demonstrates significant scalability and broad applicability, with identified future improvements including quantization, vectorized maps, and sensor integration.

Abstract

As autonomous driving systems being deployed to millions of vehicles, there is a pressing need of improving the system's scalability, safety and reducing the engineering cost. A realistic, scalable, and practical simulator of the driving world is highly desired. In this paper, we present an efficient solution based on generative models which learns the dynamics of the driving scenes. With this model, we can not only simulate the diverse futures of a given driving scenario but also generate a variety of driving scenarios conditioned on various prompts. Our innovative design allows the model to operate in both full-Autoregressive and partial-Autoregressive modes, significantly improving inference and training speed without sacrificing generative capability. This efficiency makes it ideal for being used as an online reactive environment for reinforcement learning, an evaluator for planning policies, and a high-fidelity simulator for testing. We evaluated our model against two real-world datasets: the Waymo motion dataset and the nuPlan dataset. On the simulation realism and scene generation benchmark, our model achieves the state-of-the-art performance. And in the planning benchmarks, our planner outperforms the prior arts. We conclude that the proposed generative model may serve as a foundation for a variety of motion planning tasks, including data generation, simulation, planning, and online training. Source code is public at https://github.com/HorizonRobotics/GUMP/

Solving Motion Planning Tasks with a Scalable Generative Model

TL;DR

GUMP introduces a scalable, unified generative world model for autonomous driving that learns driving scene dynamics to enable data generation, closed-loop simulation, planning evaluation, and online RL. It couples a simple key-value tokenizer with a Multimodal Causal Transformer and prediction chunking to support long-horizon generation and efficient inference via partial-AR decoding. The model achieves state-of-the-art results on Waymo Sim Agents and nuPlan planning benchmarks while enabling a flexible online training framework, positioning GUMP as a foundation model for motion planning tasks. The work demonstrates significant scalability and broad applicability, with identified future improvements including quantization, vectorized maps, and sensor integration.

Abstract

As autonomous driving systems being deployed to millions of vehicles, there is a pressing need of improving the system's scalability, safety and reducing the engineering cost. A realistic, scalable, and practical simulator of the driving world is highly desired. In this paper, we present an efficient solution based on generative models which learns the dynamics of the driving scenes. With this model, we can not only simulate the diverse futures of a given driving scenario but also generate a variety of driving scenarios conditioned on various prompts. Our innovative design allows the model to operate in both full-Autoregressive and partial-Autoregressive modes, significantly improving inference and training speed without sacrificing generative capability. This efficiency makes it ideal for being used as an online reactive environment for reinforcement learning, an evaluator for planning policies, and a high-fidelity simulator for testing. We evaluated our model against two real-world datasets: the Waymo motion dataset and the nuPlan dataset. On the simulation realism and scene generation benchmark, our model achieves the state-of-the-art performance. And in the planning benchmarks, our planner outperforms the prior arts. We conclude that the proposed generative model may serve as a foundation for a variety of motion planning tasks, including data generation, simulation, planning, and online training. Source code is public at https://github.com/HorizonRobotics/GUMP/
Paper Structure (63 sections, 14 equations, 12 figures, 13 tables)

This paper contains 63 sections, 14 equations, 12 figures, 13 tables.

Figures (12)

  • Figure 1: We are motivated to provide a generative model as the central unit that supports all the learning-based motion planning tasks in the autonomous driving domain. We categorize the tasks into four distinct sub-domains: data generation, model evaluation, model training, and model inference. These sub-domains are visually distinguished in our diagram by different colors—green for data generation, blue for model evaluation, purple for model training, and orange for model inference. Our approach encompasses both offboard applications (the first three sub-domains) and onboard application (the last sub-domain). Specifically, scene generation aims at data generation capable of producing specific traffic scenarios based on context information, such as high-definition maps or user prompts; Reactive simulation aims at a closed-loop evaluator that provides realistic, human-like agents that respond to the behavior of the ego vehicle and its environment; Online training aims at a closed-loop training module that allows the learned policy to interact with environment, collect rewards, and perform back-propagation. Lastly, interactive planning aims at enhancing an onboard planner by parallel unrolling to seek for the optimal trajectory that achieves the highest reward.
  • Figure 2: GUMP is composed of a raster encoder that encodes static information including map, route and static objects, a key-value pair tokenizer that discretizes dynamic information including the states of road users and traffic lights, a Multimodal Causal Transformer (MCT) that predicts the next latent embedding based on key queries in an autoregressive manner, and a decoder that samples the probabilistic features and decodes to future scenarios.
  • Figure 3: Prediction Chunking and Temporal Aggregation. $s_t^{p}$ denotes the per-agent state at time step $t$ and time index $p$ within the chunking data. In our context, $s_t^{p}$ comprises the set $\{x, y, \theta\}$.
  • Figure 4: This figure compares the full-AR mode with the partial-AR mode. Here, $a_{t}^i$ represents the state of the $i^{th}$ agent at time $t$. Employing a GRU decoder alongside prediction chunking, we can simultaneously predict the next state of each agent, denoted as $\hat{a}_{t+1}^{i}$. These predictions serve as surrogate conditions to bypass intra-frame sequential dependencies, markedly speeding up the process by eliminating the need for an intra-frame sequential AR procedure.
  • Figure 5: GUMP serves as a central unit, bridging offline datasets with downstream applications. By learning from the collected offline data, we utilize a generative framework to produce a vast amount of affordable, interactive data, which benefits various downstream tasks such as scene generation, reactive simulation, planning, and online training.
  • ...and 7 more figures