Table of Contents
Fetching ...

Building reliable sim driving agents by scaling self-play

Daphne Cornelisse, Aarav Pandya, Kevin Joseph, Joseph Suárez, Eugene Vinitsky

TL;DR

<3-5 sentence high-level summary> The paper tackles the reliability gap in simulation driving agents by scaling self-play reinforcement learning on a large, real-world driving dataset within a semi-realistic perception framework. Using a GPU-accelerated, data-driven multi-agent simulator and a decentralized PPO setup, the authors demonstrate near-perfect task completion with very low collision and off-road rates across 10k held-out scenes, and show strong generalization when training data is abundant. They also reveal limitations in rare or out-of-distribution scenarios and illustrate rapid adaptation through fine-tuning on small hand-designed sets. By open-sourcing the pre-trained agents and integrating them into a batched simulator, the work provides a practical pathway for scalable, reliable AV simulation and evaluation. The findings have broad implications for safe, automated AV development pipelines and potential extensions to other agent-based modeling domains.

Abstract

Simulation agents are essential for designing and testing systems that interact with humans, such as autonomous vehicles (AVs). These agents serve various purposes, from benchmarking AV performance to stress-testing system limits, but all applications share one key requirement: reliability. To enable sound experimentation, a simulation agent must behave as intended. It should minimize actions that may lead to undesired outcomes, such as collisions, which can distort the signal-to-noise ratio in analyses. As a foundation for reliable sim agents, we propose scaling self-play to thousands of scenarios on the Waymo Open Motion Dataset under semi-realistic limits on human perception and control. Training from scratch on a single GPU, our agents solve almost the full training set within a day. They generalize to unseen test scenes, achieving a 99.8% goal completion rate with less than 0.8% combined collision and off-road incidents across 10,000 held-out scenarios. Beyond in-distribution generalization, our agents show partial robustness to out-of-distribution scenes and can be fine-tuned in minutes to reach near-perfect performance in such cases. We open-source the pre-trained agents and integrate them with a batched multi-agent simulator. Demonstrations of agent behaviors can be viewed at https://sites.google.com/view/reliable-sim-agents, and we open-source our agents at https://github.com/Emerge-Lab/gpudrive.

Building reliable sim driving agents by scaling self-play

TL;DR

<3-5 sentence high-level summary> The paper tackles the reliability gap in simulation driving agents by scaling self-play reinforcement learning on a large, real-world driving dataset within a semi-realistic perception framework. Using a GPU-accelerated, data-driven multi-agent simulator and a decentralized PPO setup, the authors demonstrate near-perfect task completion with very low collision and off-road rates across 10k held-out scenes, and show strong generalization when training data is abundant. They also reveal limitations in rare or out-of-distribution scenarios and illustrate rapid adaptation through fine-tuning on small hand-designed sets. By open-sourcing the pre-trained agents and integrating them into a batched simulator, the work provides a practical pathway for scalable, reliable AV simulation and evaluation. The findings have broad implications for safe, automated AV development pipelines and potential extensions to other agent-based modeling domains.

Abstract

Simulation agents are essential for designing and testing systems that interact with humans, such as autonomous vehicles (AVs). These agents serve various purposes, from benchmarking AV performance to stress-testing system limits, but all applications share one key requirement: reliability. To enable sound experimentation, a simulation agent must behave as intended. It should minimize actions that may lead to undesired outcomes, such as collisions, which can distort the signal-to-noise ratio in analyses. As a foundation for reliable sim agents, we propose scaling self-play to thousands of scenarios on the Waymo Open Motion Dataset under semi-realistic limits on human perception and control. Training from scratch on a single GPU, our agents solve almost the full training set within a day. They generalize to unseen test scenes, achieving a 99.8% goal completion rate with less than 0.8% combined collision and off-road incidents across 10,000 held-out scenarios. Beyond in-distribution generalization, our agents show partial robustness to out-of-distribution scenes and can be fine-tuned in minutes to reach near-perfect performance in such cases. We open-source the pre-trained agents and integrate them with a batched multi-agent simulator. Demonstrations of agent behaviors can be viewed at https://sites.google.com/view/reliable-sim-agents, and we open-source our agents at https://github.com/Emerge-Lab/gpudrive.

Paper Structure

This paper contains 54 sections, 4 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Overview of approach.Left: We define several criteria to guide the learning of simulation agents through rewards. The reward function is a weighted combination of these criteria: $r(o^i_t )= \sum_i c_i \cdot \mathbb{I}[\text{criteria}_i]$. Here, we focus on achieving goal-directed nominal sim agent behavior—ensuring agents stay on the road and avoid collisions while navigating to a target position. Right: Over 24 hours on a single GPU, we iterate through 10,000 scenarios (green curve) from the Waymo Open Motion Dataset in GPUDrive kazemkhani2024gpudrive, reaching near-perfect performance (blue curve, reliability) on the defined criteria after 2 billion agent steps by self-play PPO. The example scenarios illustrate agent behavior at different stages of training. Initially, agents display random behavior and frequently collide with each other and the road edges (marked in orange and red), but their behavior becomes streamlined over many iterations.
  • Figure 2: Sample scenario state with corresponding agent observation. Left: Example scenario from the Waymo Open Motion Dataset rendered in GPUDrive as shown from a bird's eye view. The boxes ($\textcolor{RoyalBlue}{\hrectangle}$) indicate controlled agents and the circles ($\textcolor{RoyalBlue}{\odot}$) indicate the goal positions for every controlled agent. Right: Scene view from the agent in the center ($\textcolor{Periwinkle}{\hrectangle}$). Agents see a subset of the road points within a configurable radius (here $r_o = 50$ meters) and their corresponding types and segment length. Road types are road edges ($\textcolor{Black}{\bullet}$) and road lanes ($\textcolor{Gray}{\bullet}$) They can also view the relative position and velocity of the other agents in the scene ($\textcolor{YellowOrange}{\hrectangle}$). Agents in gray are static throughout the episode as they are parked cars but this information is not visible to the agent i.e. the agent does not know that the gray cars are guaranteed not to move and consequently all cars are orange in the agent observation view.
  • Figure 3: Network architecture. The relative observation vector $o_t^i$ is first decomposed into its separate modalities: the ego state (i.e. the agent's information about itself and its goals), the visible portion of the road graph, and the speeds, yaws, and relative positions of the other agents in the scene. These modalities are first processed separately. Their outputs are combined and max pooled, then processed together. The hidden layer is finally fed into an actor and a critic head.
  • Figure 4: Scaling with data. Average performance with standard errors on 10,000 unseen scenarios from the WOMD validation set as a function of the training dataset size. The striped lines indicate optimal performance.
  • Figure 5: Batch performance throughout training.Left: Average reward per agent (maximum of 1) as a function of wall-clock time. We train agents for at most 24 hours. Center: Goal achievement rate per batch as a function of global steps (2 billion steps generated in 24 hours). Right: Percentage of agents that collide with another agent (red) or with a road edge (orange). All curves are smoothed using a rolling window of 250 steps. The inset figures show a zoomed-in view of the final four hours of the run, with the y-axes displayed on a logarithmic scale. The red annotations on the insets indicate the minimum and maximum values within the zoomed-in window. Note that the metrics reported during training are by excluding trivial agents, we only control agents that have to drive for more than 2 meters to reach their goal destination.
  • ...and 4 more figures