Table of Contents
Fetching ...

Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving

Shai Shalev-Shwartz, Shaked Shammah, Amnon Shashua

TL;DR

Problem: enabling safe, long-horizon autonomous driving in multi-agent urban settings with rare accident events. Approach: policy gradient methods without Markov assumptions; Desires-trajectory decomposition with hard safety constraints; Option Graph for hierarchical temporal abstraction to reduce horizon and variance. Contributions: theoretical demonstration of non-Markov policy gradients, a safety-centric architecture, and a practical DAG-based option framework demonstrated on a double-merge task. Impact: provides a principled, scalable path to integrate learning with stringent safety guarantees in autonomous driving.

Abstract

Autonomous driving is a multi-agent setting where the host vehicle must apply sophisticated negotiation skills with other road users when overtaking, giving way, merging, taking left and right turns and while pushing ahead in unstructured urban roadways. Since there are many possible scenarios, manually tackling all possible cases will likely yield a too simplistic policy. Moreover, one must balance between unexpected behavior of other drivers/pedestrians and at the same time not to be too defensive so that normal traffic flow is maintained. In this paper we apply deep reinforcement learning to the problem of forming long term driving strategies. We note that there are two major challenges that make autonomous driving different from other robotic tasks. First, is the necessity for ensuring functional safety - something that machine learning has difficulty with given that performance is optimized at the level of an expectation over many instances. Second, the Markov Decision Process model often used in robotics is problematic in our case because of unpredictable behavior of other agents in this multi-agent scenario. We make three contributions in our work. First, we show how policy gradient iterations can be used without Markovian assumptions. Second, we decompose the problem into a composition of a Policy for Desires (which is to be learned) and trajectory planning with hard constraints (which is not learned). The goal of Desires is to enable comfort of driving, while hard constraints guarantees the safety of driving. Third, we introduce a hierarchical temporal abstraction we call an "Option Graph" with a gating mechanism that significantly reduces the effective horizon and thereby reducing the variance of the gradient estimation even further.

Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving

TL;DR

Problem: enabling safe, long-horizon autonomous driving in multi-agent urban settings with rare accident events. Approach: policy gradient methods without Markov assumptions; Desires-trajectory decomposition with hard safety constraints; Option Graph for hierarchical temporal abstraction to reduce horizon and variance. Contributions: theoretical demonstration of non-Markov policy gradients, a safety-centric architecture, and a practical DAG-based option framework demonstrated on a double-merge task. Impact: provides a principled, scalable path to integrate learning with stringent safety guarantees in autonomous driving.

Abstract

Autonomous driving is a multi-agent setting where the host vehicle must apply sophisticated negotiation skills with other road users when overtaking, giving way, merging, taking left and right turns and while pushing ahead in unstructured urban roadways. Since there are many possible scenarios, manually tackling all possible cases will likely yield a too simplistic policy. Moreover, one must balance between unexpected behavior of other drivers/pedestrians and at the same time not to be too defensive so that normal traffic flow is maintained. In this paper we apply deep reinforcement learning to the problem of forming long term driving strategies. We note that there are two major challenges that make autonomous driving different from other robotic tasks. First, is the necessity for ensuring functional safety - something that machine learning has difficulty with given that performance is optimized at the level of an expectation over many instances. Second, the Markov Decision Process model often used in robotics is problematic in our case because of unpredictable behavior of other agents in this multi-agent scenario. We make three contributions in our work. First, we show how policy gradient iterations can be used without Markovian assumptions. Second, we decompose the problem into a composition of a Policy for Desires (which is to be learned) and trajectory planning with hard constraints (which is not learned). The goal of Desires is to enable comfort of driving, while hard constraints guarantees the safety of driving. Third, we introduce a hierarchical temporal abstraction we call an "Option Graph" with a gating mechanism that significantly reduces the effective horizon and thereby reducing the variance of the gradient estimation even further.

Paper Structure

This paper contains 12 sections, 5 theorems, 18 equations, 2 figures.

Key Result

Theorem 1

Denote Then, $\mathop{\mathrm{\mathbb{E}}}\limits_{\bar{s} \sim P_\theta} \hat{\nabla}(\bar{s}) = \nabla \mathop{\mathrm{\mathbb{E}}}\limits_{\bar{s} \sim P_\theta} [R(\bar{s})]$.

Figures (2)

  • Figure 1: The double merge scenario. Vehicles arrive from the left or right side to the merge area. Some vehicles should continue on their road while other vehicles should merge to the other side. In dense traffic, vehicles must negotiate the right of way.
  • Figure 2: An options graph for the double merge scenario.

Theorems & Definitions (5)

  • Theorem 1
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Lemma 4