Table of Contents
Fetching ...

Environment Design for Inverse Reinforcement Learning

Thomas Kleine Buening, Victor Villin, Christos Dimitrakakis

TL;DR

The paper tackles the low sample-efficiency and poor robustness of inverse reinforcement learning (IRL) when demonstrations come from a fixed environment. It introduces Environment Design for IRL, a framework that adaptively selects informative environments to elicit demonstrations, formalized via a maximin Bayesian regret objective over an environment set $\mathcal{T}$. By extending Bayesian IRL (BIRL) and MaxEnt IRL (AIRL) to multiple environments, the authors present ED-BIRL and ED-AIRL, with ED-AIRL leveraging AIRL-ME and multi-environment reward estimates. Experiments on discrete mazes and continuous-control tasks show ED-BIRL and ED-AIRL recover nearly all relevant reward structure and offer improved robustness to dynamics perturbations, outperforming fixed-environment IRL and domain randomisation, albeit with higher computational cost in the multi-environment setting.

Abstract

Learning a reward function from demonstrations suffers from low sample-efficiency. Even with abundant data, current inverse reinforcement learning methods that focus on learning from a single environment can fail to handle slight changes in the environment dynamics. We tackle these challenges through adaptive environment design. In our framework, the learner repeatedly interacts with the expert, with the former selecting environments to identify the reward function as quickly as possible from the expert's demonstrations in said environments. This results in improvements in both sample-efficiency and robustness, as we show experimentally, for both exact and approximate inference.

Environment Design for Inverse Reinforcement Learning

TL;DR

The paper tackles the low sample-efficiency and poor robustness of inverse reinforcement learning (IRL) when demonstrations come from a fixed environment. It introduces Environment Design for IRL, a framework that adaptively selects informative environments to elicit demonstrations, formalized via a maximin Bayesian regret objective over an environment set . By extending Bayesian IRL (BIRL) and MaxEnt IRL (AIRL) to multiple environments, the authors present ED-BIRL and ED-AIRL, with ED-AIRL leveraging AIRL-ME and multi-environment reward estimates. Experiments on discrete mazes and continuous-control tasks show ED-BIRL and ED-AIRL recover nearly all relevant reward structure and offer improved robustness to dynamics perturbations, outperforming fixed-environment IRL and domain randomisation, albeit with higher computational cost in the multi-environment setting.

Abstract

Learning a reward function from demonstrations suffers from low sample-efficiency. Even with abundant data, current inverse reinforcement learning methods that focus on learning from a single environment can fail to handle slight changes in the environment dynamics. We tackle these challenges through adaptive environment design. In our framework, the learner repeatedly interacts with the expert, with the former selecting environments to identify the reward function as quickly as possible from the expert's demonstrations in said environments. This results in improvements in both sample-efficiency and robustness, as we show experimentally, for both exact and approximate inference.
Paper Structure (54 sections, 4 theorems, 12 equations, 12 figures, 3 tables, 6 algorithms)

This paper contains 54 sections, 4 theorems, 12 equations, 12 figures, 3 tables, 6 algorithms.

Key Result

lemma 1

If for some posterior $\mathbb{P}(\cdot \mid \mathcal{D})$ we have $\max_{T\in \mathcal{T}} \min_{\pi\in \Pi} \mathop{\mathrm{BR}}\nolimits_{\mathbb{P}}(T, \pi) = 0$, then the posterior mean $\bar{R} = \mathbb{E}_{\mathbb{P}} [R]$ is optimal for every $T \in \mathcal{T}$, i.e., $\bar{R}$ induces an

Figures (12)

  • Figure 1: The expert navigates to the closest of the three possible goal squares while avoiding lava in adaptively elected maze environments. For three consecutive rounds (a)-(c), we display the mazes chosen by ED-BIRL (Algorithm \ref{['algorithm:ED-BIRL']} in Section \ref{['section:irl']}) as well as the current reward estimate after observing an expert trajectory in the current and past mazes. By adaptively designing environments and combining the expert demonstrations, we can recover the locations of all goal and most lava squares. In contrast, from observations in a fixed environment, e.g., repeatedly observing the expert in maze (a), it would be impossible to recover all relevant aspects of the reward function, i.e., the location of the goal squares, as only the nearest goal square would be visited by the expert (repeatedly). Observing the human expert in new and carefully curated environments can lead to a more precise and robust estimate of the unknown reward function.
  • Figure 2: The discrete maze task from Figure \ref{['fig:intro']}. The goal in the discrete maze environment is to reach one of the three green goal squares while avoiding lava squares. For each approach, we show on the left the chosen mazes and visualise on the right the posterior mean reward after three rounds. In (a), the expert always acts in the same, fixed maze. In (b), the maze is randomly generated by adding obstacles, i.e., gray squares, uniformly at random. The proposed ED-BIRL approach, which adaptively chooses maze layouts based on past reward estimates, is shown in (c). We use the same colour scale as in Figure \ref{['fig:intro']}, which ranges from black (0.0) to red (0.5) to white (1.0).
  • Figure 3: Examples of demo and test environments.
  • Figure 4: Average normalised performance on the demo set in the continuous maze as we increase the number of expert trajectories (averaged over 5 runs). The standard error is shown in shaded colour. Every 5 trajectories, $\texttt{ED-AIRL}$ chooses a new environment for the expert to act in. In contrast, $\texttt{AIRL}$ always observes the expert act in the same base environment.
  • Figure 5: On a randomly generated MDP task, we evaluate the robustness of reward estimates learned by ED-BIRL, Domain Randomisation, and Fixed Environment IRL, respectively.
  • ...and 7 more figures

Theorems & Definitions (6)

  • lemma 1
  • remark 1
  • proof : Proof of Lemma \ref{['lem:maximin_zero']}
  • lemma 2: Generalisability of Optimal Rewards
  • lemma 3: Generalisability of Optimal Rewards with Optimal Demonstrator
  • lemma 4: Lower Bound on Generalisability