Table of Contents
Fetching ...

Learning Safety Constraints from Demonstrations with Unknown Rewards

David Lindner, Xin Chen, Sebastian Tschiatschek, Katja Hofmann, Andreas Krause

TL;DR

Convex Constraint Learning for Reinforcement Learning for Reinforcement Learning (CoCoRL), a novel approach for inferring shared constraints in a Constrained Markov Decision Process (CMDP) from a set of safe demonstrations with possibly different reward functions, is proposed.

Abstract

We propose Convex Constraint Learning for Reinforcement Learning (CoCoRL), a novel approach for inferring shared constraints in a Constrained Markov Decision Process (CMDP) from a set of safe demonstrations with possibly different reward functions. While previous work is limited to demonstrations with known rewards or fully known environment dynamics, CoCoRL can learn constraints from demonstrations with different unknown rewards without knowledge of the environment dynamics. CoCoRL constructs a convex safe set based on demonstrations, which provably guarantees safety even for potentially sub-optimal (but safe) demonstrations. For near-optimal demonstrations, CoCoRL converges to the true safe set with no policy regret. We evaluate CoCoRL in gridworld environments and a driving simulation with multiple constraints. CoCoRL learns constraints that lead to safe driving behavior. Importantly, we can safely transfer the learned constraints to different tasks and environments. In contrast, alternative methods based on Inverse Reinforcement Learning (IRL) often exhibit poor performance and learn unsafe policies.

Learning Safety Constraints from Demonstrations with Unknown Rewards

TL;DR

Convex Constraint Learning for Reinforcement Learning for Reinforcement Learning (CoCoRL), a novel approach for inferring shared constraints in a Constrained Markov Decision Process (CMDP) from a set of safe demonstrations with possibly different reward functions, is proposed.

Abstract

We propose Convex Constraint Learning for Reinforcement Learning (CoCoRL), a novel approach for inferring shared constraints in a Constrained Markov Decision Process (CMDP) from a set of safe demonstrations with possibly different reward functions. While previous work is limited to demonstrations with known rewards or fully known environment dynamics, CoCoRL can learn constraints from demonstrations with different unknown rewards without knowledge of the environment dynamics. CoCoRL constructs a convex safe set based on demonstrations, which provably guarantees safety even for potentially sub-optimal (but safe) demonstrations. For near-optimal demonstrations, CoCoRL converges to the true safe set with no policy regret. We evaluate CoCoRL in gridworld environments and a driving simulation with multiple constraints. CoCoRL learns constraints that lead to safe driving behavior. Importantly, we can safely transfer the learned constraints to different tasks and environments. In contrast, alternative methods based on Inverse Reinforcement Learning (IRL) often exhibit poor performance and learn unsafe policies.
Paper Structure (62 sections, 25 theorems, 73 equations, 9 figures, 4 algorithms)

This paper contains 62 sections, 25 theorems, 73 equations, 9 figures, 4 algorithms.

Key Result

Proposition 1

There are CMDPs $\mathcal{C} = (S, A, P, \mu_0, \gamma, r, \{c_j\}_{j=1}^n, \{\xi_j\}_{j=1}^n )$ such that for any optimal policy $\pi^*$ in $\mathcal{C}$ and any reward function $r_{\text{IRL}}$ that could be returned by an IRL algorithm, the resulting MDP $(S, A, P, \mu_0, \gamma, r_{\text{IRL}})$

Figures (9)

  • Figure 1: CoCoRL can, e.g., learn safe driving behavior from diverse driving trajectories with different unknown reward functions (here, $r_1$: "turn left", $r_2$: "turn right") . It infers constraints $c_1, c_2, c_3$ describing desirable driving behavior from demonstrations without knowledge of the specific reward functions $r_1, r_2$. These inferred constraints allow to optimize for a new reward function $r_{\text{eval}}$ ("go straight"), ensuring safe driving behavior even in new situations where matching demonstrations are unavailable.
  • Figure 2: Experimental results in Gridworld environments. We consider three settings: (\ref{['subfig:gridworld_exp1']}) no constraint transfer, (\ref{['subfig:gridworld_exp2']}) transferring constraints to new goals in the same grid, and (\ref{['subfig:gridworld_exp3']}) transferring constraints to a new Gridworld with the same structure but different transition dynamics. For each setting, we measure the normalized policy return (higher is better), and the constraint violation (lower is better). The plots show mean and standard errors over 100 random seeds. CoCoRL consistently returns safe solutions, outperforming the IRL-based methods that generally perform worse and are unsafe. The IL baseline performs exactly the same as CoCoRL with no environment transfer (the lines overlap in plots \ref{['subfig:gridworld_exp1']} and \ref{['subfig:gridworld_exp2']}), but it produces unsafe solutions with transfer (\ref{['subfig:gridworld_exp3']}). A return greater than 1 indicates a solution that surpasses the best safe policy, implying a constraint violation.
  • Figure 3: Return and constraint violation in the highway-env intersection environment. All plots show mean and standard error over $5$ random seeds with a fixed set of evaluation rewards for each setting. In all settings, CoCoRL consistently returns safe policies, as indicated by the low constraint violation values. However, there are instances where CoCoRL falls back to providing a default safe solution when the policy optimizer fails to find a feasible solution within the safe set $\mathcal{S}$. The frequency of falling back to the default solution is shown by the bars in the constraint violation plots (). In contrast, the IRL method often yields unsafe solutions. IL outperforms CoCoRL because we implement an idealized version with perfect imitation. However, for task transfer IL performs much worse than CoCoRL and for environment transfer it produces unsafe solutions.
  • Figure 4: Illustration of the guaranteed hull in 2D. The gray areas depict confidence sets $\mathcal{C}_1, \mathcal{C}_2, \mathcal{C}_3$, and the green area is the conservative safe set $\hat{\mathcal{S}}$ constructed using the guaranteed hull. For large confidence sets, the guaranteed hull can be empty (left); for small confidence sets, it approaches the safe set constructed using the exact feature expectations (right).
  • Figure 5: Illustration of safe set and "unsafe" set in 2D.
  • ...and 4 more figures

Theorems & Definitions (41)

  • Proposition 1: IRL can be unsafe
  • Proposition 2
  • Lemma 1: $\truesafeset$ is convex
  • Theorem 1: Estimated safe set
  • Theorem 2: Inferred CMDP
  • Corollary 1: $\safeset$ is maximal
  • Theorem 3: Convergence, exact optimality
  • Theorem 4: Convergence, Boltzmann-rationality
  • Theorem 5: $\epsilon$-safety with estimated feature expectations.
  • Proposition 2: IRL can be unsafe
  • ...and 31 more