Table of Contents
Fetching ...

Human-compatible driving partners through data-regularized self-play reinforcement learning

Daphne Cornelisse, Eugene Vinitsky

TL;DR

This work proposes Human-Regularized PPO (HR-PPO), a multi-agent algorithm where agents are trained through self-play with a small penalty for deviating from a human reference policy, and finds that HR-PPO agents show considerable improvements on proxy measures for coordination with human driving, particularly in highly interactive scenarios.

Abstract

A central challenge for autonomous vehicles is coordinating with humans. Therefore, incorporating realistic human agents is essential for scalable training and evaluation of autonomous driving systems in simulation. Simulation agents are typically developed by imitating large-scale, high-quality datasets of human driving. However, pure imitation learning agents empirically have high collision rates when executed in a multi-agent closed-loop setting. To build agents that are realistic and effective in closed-loop settings, we propose Human-Regularized PPO (HR-PPO), a multi-agent algorithm where agents are trained through self-play with a small penalty for deviating from a human reference policy. In contrast to prior work, our approach is RL-first and only uses 30 minutes of imperfect human demonstrations. We evaluate agents in a large set of multi-agent traffic scenes. Results show our HR-PPO agents are highly effective in achieving goals, with a success rate of 93%, an off-road rate of 3.5%, and a collision rate of 3%. At the same time, the agents drive in a human-like manner, as measured by their similarity to existing human driving logs. We also find that HR-PPO agents show considerable improvements on proxy measures for coordination with human driving, particularly in highly interactive scenarios. We open-source our code and trained agents at https://github.com/Emerge-Lab/nocturne_lab and provide demonstrations of agent behaviors at https://sites.google.com/view/driving-partners.

Human-compatible driving partners through data-regularized self-play reinforcement learning

TL;DR

This work proposes Human-Regularized PPO (HR-PPO), a multi-agent algorithm where agents are trained through self-play with a small penalty for deviating from a human reference policy, and finds that HR-PPO agents show considerable improvements on proxy measures for coordination with human driving, particularly in highly interactive scenarios.

Abstract

A central challenge for autonomous vehicles is coordinating with humans. Therefore, incorporating realistic human agents is essential for scalable training and evaluation of autonomous driving systems in simulation. Simulation agents are typically developed by imitating large-scale, high-quality datasets of human driving. However, pure imitation learning agents empirically have high collision rates when executed in a multi-agent closed-loop setting. To build agents that are realistic and effective in closed-loop settings, we propose Human-Regularized PPO (HR-PPO), a multi-agent algorithm where agents are trained through self-play with a small penalty for deviating from a human reference policy. In contrast to prior work, our approach is RL-first and only uses 30 minutes of imperfect human demonstrations. We evaluate agents in a large set of multi-agent traffic scenes. Results show our HR-PPO agents are highly effective in achieving goals, with a success rate of 93%, an off-road rate of 3.5%, and a collision rate of 3%. At the same time, the agents drive in a human-like manner, as measured by their similarity to existing human driving logs. We also find that HR-PPO agents show considerable improvements on proxy measures for coordination with human driving, particularly in highly interactive scenarios. We open-source our code and trained agents at https://github.com/Emerge-Lab/nocturne_lab and provide demonstrations of agent behaviors at https://sites.google.com/view/driving-partners.
Paper Structure (47 sections, 10 equations, 22 figures, 8 tables)

This paper contains 47 sections, 10 equations, 22 figures, 8 tables.

Figures (22)

  • Figure 1: LHS: A bird's eye view of an example scenario in the training dataset from the perspective of the green agent in the bottom center. RHS: Agents only have a partial view of the environment and must plan under uncertainty.
  • Figure 2: Overview of metrics used for evaluation. Left: Agents achieve their goal if they reach the target (color-coded circles) without collisions before the episode ends ($80$ steps). In this example, the goal rate is $1/3$ (only the yellow car reaches its goal), the off-road rate is $1/3$ (the green car hits a road edge) and the collision rate is $0$ (no vehicle crashes with another vehicle). Right: Realism metrics concern how agents navigate to their goal positions, that is, the extent to which the policy-generated trajectories (orange) resemble the logged human ones (green).
  • Figure 3: Goal-Conditioned Average Displacement Error (GC-ADE) to logged human driver positions against effectiveness metrics conditioned on knowing the goal. Policies are evaluated on the training dataset of 200 scenarios.
  • Figure 4: Steering MAE against effectiveness metrics.
  • Figure 5: Overall performance gap between evaluating in self-play vs. log-replay settings across the 200 training scenarios.
  • ...and 17 more figures