Table of Contents
Fetching ...

Learning to Drive via Asymmetric Self-Play

Chris Zhang, Sourav Biswas, Kelvin Wong, Kion Fallah, Lunjun Zhang, Dian Chen, Sergio Casas, Raquel Urtasun

TL;DR

This work proposes asymmetric self-play to scale beyond real data with additional challenging, solvable, and realistic synthetic scenarios, and pairs a teacher that learns to generate scenarios it can solve but the student cannot, with a student that learns to solve them.

Abstract

Large-scale data is crucial for learning realistic and capable driving policies. However, it can be impractical to rely on scaling datasets with real data alone. The majority of driving data is uninteresting, and deliberately collecting new long-tail scenarios is expensive and unsafe. We propose asymmetric self-play to scale beyond real data with additional challenging, solvable, and realistic synthetic scenarios. Our approach pairs a teacher that learns to generate scenarios it can solve but the student cannot, with a student that learns to solve them. When applied to traffic simulation, we learn realistic policies with significantly fewer collisions in both nominal and long-tail scenarios. Our policies further zero-shot transfer to generate training data for end-to-end autonomy, significantly outperforming state-of-the-art adversarial approaches, or using real data alone. For more information, visit https://waabi.ai/selfplay .

Learning to Drive via Asymmetric Self-Play

TL;DR

This work proposes asymmetric self-play to scale beyond real data with additional challenging, solvable, and realistic synthetic scenarios, and pairs a teacher that learns to generate scenarios it can solve but the student cannot, with a student that learns to solve them.

Abstract

Large-scale data is crucial for learning realistic and capable driving policies. However, it can be impractical to rely on scaling datasets with real data alone. The majority of driving data is uninteresting, and deliberately collecting new long-tail scenarios is expensive and unsafe. We propose asymmetric self-play to scale beyond real data with additional challenging, solvable, and realistic synthetic scenarios. Our approach pairs a teacher that learns to generate scenarios it can solve but the student cannot, with a student that learns to solve them. When applied to traffic simulation, we learn realistic policies with significantly fewer collisions in both nominal and long-tail scenarios. Our policies further zero-shot transfer to generate training data for end-to-end autonomy, significantly outperforming state-of-the-art adversarial approaches, or using real data alone. For more information, visit https://waabi.ai/selfplay .
Paper Structure (51 sections, 4 theorems, 42 equations, 7 figures, 4 tables, 2 algorithms)

This paper contains 51 sections, 4 theorems, 42 equations, 7 figures, 4 tables, 2 algorithms.

Key Result

lemma thmcounterlemma

If $\pi_T$ and $\pi_S$ are in equilibrium ($\pi_T$ cannot improve without changing $\pi_S$ and vice versa), then $R_T \leq 2 \beta I_\text{data}(\pi_{TS})$.

Figures (7)

  • Figure 1: Asymmetric Self-Play. The teacher (red, green) learns to generate realistic scenarios where the student (blue) makes a mistake (top) while demonstrating a solution itself (bottom). The two are jointly trained to continually solve more scenarios.
  • Figure 2: Method Overview. We sample an initial scene and designate adversarial actors at random. The teacher must control adversarial actors such that the student fails, but itself passes. Adversarial actions are replayed to keep the scenario the same.
  • Figure 3: Policy Architecture. We encode $K$ lane graph nodes and state history for $N$ actors over $H$ history timesteps into $D$-dimensional features. A transformer backbone with $M$ blocks uses factorized attention to extract features before decoding them into actor steering and acceleration. The teacher policy additionally encodes actor type (if an actor is in ${\mathcal{T}}$) and target information; the student does not observe this information.
  • Figure 4: Qualitative Comparison. We show TrafficSim (top) and Ours (bottom) on Argoverse2. Our method learns better interaction reasoning to avoid collisions realistically. Colored actors are controlled; gray actors are replayed.
  • Figure 5: (Left): When the student is training, adversarial success plateaus but the student continually improves. (Center): When the student is frozen, adversarial success improves along with teacher performance. (Right): Our approach dominates the Pareto frontier obtained from naively increasing collision loss weight.
  • ...and 2 more figures

Theorems & Definitions (10)

  • definition thmcounterdefinition
  • lemma thmcounterlemma
  • proof
  • theorem thmcountertheorem
  • proof
  • definition thmcounterdefinition
  • lemma thmcounterlemma
  • proof
  • theorem thmcountertheorem
  • proof