TorchDriveEnv: A Reinforcement Learning Benchmark for Autonomous Driving with Reactive, Realistic, and Diverse Non-Playable Characters

Jonathan Wilder Lavington; Ke Zhang; Vasileios Lioutas; Matthew Niedoba; Yunpeng Liu; Dylan Green; Saeid Naderiparizi; Xiaoxuan Liang; Setareh Dabiri; Adam Ścibior; Berend Zwartsenberg; Frank Wood

TorchDriveEnv: A Reinforcement Learning Benchmark for Autonomous Driving with Reactive, Realistic, and Diverse Non-Playable Characters

Jonathan Wilder Lavington, Ke Zhang, Vasileios Lioutas, Matthew Niedoba, Yunpeng Liu, Dylan Green, Saeid Naderiparizi, Xiaoxuan Liang, Setareh Dabiri, Adam Ścibior, Berend Zwartsenberg, Frank Wood

TL;DR

The work addresses the need for realistic, efficient, and adaptable simulators to train autonomous driving controllers under varied NPC behaviors. It introduces TorchDriveSim, a differentiable 2D driving simulator, and TorchDriveEnv, a Gym-compatible RL benchmark that integrates data-driven, reactive NPCs via an external API, with CARLA-based maps and train/validation splits. Evaluations of common RL baselines (SAC, PPO, A2C, TD3) reveal that multi-agent training is more challenging yet yields better generalization, while even strong policies incur infractions, underscoring the need for objective-aligned optimization. Overall, TorchDriveSim/Env provide a practical, extensible framework for robust AV controller development and pave the way for more realistic NPC behavior modeling and differentiable dynamics.

Abstract

The training, testing, and deployment, of autonomous vehicles requires realistic and efficient simulators. Moreover, because of the high variability between different problems presented in different autonomous systems, these simulators need to be easy to use, and easy to modify. To address these problems we introduce TorchDriveSim and its benchmark extension TorchDriveEnv. TorchDriveEnv is a lightweight reinforcement learning benchmark programmed entirely in Python, which can be modified to test a number of different factors in learned vehicle behavior, including the effect of varying kinematic models, agent types, and traffic control patterns. Most importantly unlike many replay based simulation approaches, TorchDriveEnv is fully integrated with a state of the art behavioral simulation API. This allows users to train and evaluate driving models alongside data driven Non-Playable Characters (NPC) whose initializations and driving behavior are reactive, realistic, and diverse. We illustrate the efficiency and simplicity of TorchDriveEnv by evaluating common reinforcement learning baselines in both training and validation environments. Our experiments show that TorchDriveEnv is easy to use, but difficult to solve.

TorchDriveEnv: A Reinforcement Learning Benchmark for Autonomous Driving with Reactive, Realistic, and Diverse Non-Playable Characters

TL;DR

Abstract

Paper Structure (13 sections, 1 equation, 4 figures, 1 table)

This paper contains 13 sections, 1 equation, 4 figures, 1 table.

Introduction
Related Work
Design and Features
Simulator
Agents
Generative Models of Vehicle Behaviour
Environment
Action Space
Observation Space
End-Conditions
Reward
Benchmark
Discussion

Figures (4)

Figure 1: Frames from five $\texttt{TorchDriveEnv}$ stochastic episodes encountered during SAC haarnoja2018soft RL agent training. In all examples, the ego vehicle is red, the next waypoint is a green circle, and non-playable character (NPC) vehicles are blue. Drivable surfaces are grey, while traffic lights are denoted by thin coloured rectangles at stop-lines that are red, yellow, or green. Lanes are indicated by purple, dark green and white lines for right, left, or overlapping lane boundaries. In each example, five uniformly spaced frames were taken from a single episode and are displayed sequentially left to right. These examples illustrate the diverse traffic and road conditions encountered during training. Note the high density and realistic behaviour displayed by the NPCs, particularly in \ref{['fig:train-ex:crowded-h', 'fig:train-ex:crowded-i', 'fig:train-ex:crowded-r', 'fig:train-ex:crowded-m']}.
Figure 2: Stochastic initializations of validation scenarios used to test learned agents' out of distribution performance. Each scenario tests a different set of agent capabilities. The first scenario (a) tests an agent's ability to drive around a parked vehicle obstructing the road. The three-way intersection (b) tests an agent's ability to navigate a three way intersection and yield as needed to cross traffic. The example in (c) requires the agent to negotiate around a collision with an oncoming passing car in its lane. The roundabout in (d) requires the agent to merge, yield, change lanes, and finally, exit the roundabout. The last example in (e) tests if the agent can navigate through controlled intersections in the presence of NPCs that obey traffic lights realistically.
Figure 3: Two stochastic initializations drawn from two separate training scenarios. The first scene (\ref{['fig:train-ex:sc-1:init-1']} and \ref{['fig:train-ex:sc-1:init-2']}) includes the initialization over the entire map, while the second (\ref{['fig:train-ex:sc-2:init-1']} and \ref{['fig:train-ex:sc-2:init-2']}) provides a similar visualization of a specific intersection in a separate scenario. While certain distributional characteristics are similar (e.g. cars are stopped at a stop-lights), positions and even initial velocities of these vehicles differ between initializations. In all scenes, green dots indicate the sequence of waypoints provided to the agent, the red rectangle indicates the ego vehicle, blue rectangles indicate NPCs, and an orange arrow indicates the direction of the vehicles. Traffic lights are provided by coloured bars, which are red, yellow, or green. Notice that all NPCs in each of the initializations are going the correct direction and conform to traffic light state.
Figure 4: The $\texttt{TorchDriveEnv}$ benchmark: training curves for multi-agent and single-agent $\texttt{TorchDriveEnv}$ environments across each of their respective training and validation scenarios. Collision indicates the percentage of time in which an episode ends in a collision, Traffic-Light indicates the percentage of times an episode ends in a traffic light violation, Waypoint # indicates the average number of waypoints achieved during an episode, Offroad indicates the percentage of episodes which end in an off road infraction, Return indicates the average cumulative reward per episode, and finally Horizon indicates the average length of episodes observed. We include two dotted and dashed lines which indicate the average performance of across 4 randomly seeded RL agents evaluated on 10 episodes each. The dotted line indicates performance of agents trained in a multi-agent environment, while the dashed line indicates performance of agents trained in an ego-only environment. The plots above illustrate two important things. The first is that learning agents within a multi-agent environment is more difficult then in a single agent environment, as illustrated by significantly lower reward and time-horizon metrics. Second the performance gap between multi-agent trained models in a single agent environment is generally much lower (see the horizon and return plots) then single-agent trained models being evaluated in a multi-agent setting. Across all benchmarks, vehicles trained in a multi-agent environment survive for longer (larger horizon) than ego only trained environments.

TorchDriveEnv: A Reinforcement Learning Benchmark for Autonomous Driving with Reactive, Realistic, and Diverse Non-Playable Characters

TL;DR

Abstract

TorchDriveEnv: A Reinforcement Learning Benchmark for Autonomous Driving with Reactive, Realistic, and Diverse Non-Playable Characters

Authors

TL;DR

Abstract

Table of Contents

Figures (4)