Table of Contents
Fetching ...

A Bayesian Solution To The Imitation Gap

Risto Vuorio, Mattie Fellows, Cong Lu, Clémence Grislain, Shimon Whiteson

TL;DR

The paper tackles the imitation gap that arises when the expert has access to privileged information not available to the imitator. It presents BIG, a fully Bayesian pipeline that first learns a posterior over rewards from expert demonstrations via contextual Bayesian IRL, then incorporates a cost of exploration prior to enable prudent test-time exploration, and finally computes a Bayes-optimal policy in a BAMDP. Key ideas include contextual successor features to separate dynamics from rewards, a Laplace-approximated Bayesian IRL step, and a COE prior that rescales rewards within [r_min, r_max] while introducing uncertainty over exploration costs. Empirically, BIG outperforms standard imitation learning in imitation-gap scenarios, recovers reward structure in simple and large CMDPs, and scales to high dimensional observations, demonstrating the practical viability of Bayes-optimal exploration in imitation-limited settings.

Abstract

In many real-world settings, an agent must learn to act in environments where no reward signal can be specified, but a set of expert demonstrations is available. Imitation learning (IL) is a popular framework for learning policies from such demonstrations. However, in some cases, differences in observability between the expert and the agent can give rise to an imitation gap such that the expert's policy is not optimal for the agent and a naive application of IL can fail catastrophically. In particular, if the expert observes the Markov state and the agent does not, then the expert will not demonstrate the information-gathering behavior needed by the agent but not the expert. In this paper, we propose a Bayesian solution to the Imitation Gap (BIG), first using the expert demonstrations, together with a prior specifying the cost of exploratory behavior that is not demonstrated, to infer a posterior over rewards with Bayesian inverse reinforcement learning (IRL). BIG then uses the reward posterior to learn a Bayes-optimal policy. Our experiments show that BIG, unlike IL, allows the agent to explore at test time when presented with an imitation gap, whilst still learning to behave optimally using expert demonstrations when no such gap exists.

A Bayesian Solution To The Imitation Gap

TL;DR

The paper tackles the imitation gap that arises when the expert has access to privileged information not available to the imitator. It presents BIG, a fully Bayesian pipeline that first learns a posterior over rewards from expert demonstrations via contextual Bayesian IRL, then incorporates a cost of exploration prior to enable prudent test-time exploration, and finally computes a Bayes-optimal policy in a BAMDP. Key ideas include contextual successor features to separate dynamics from rewards, a Laplace-approximated Bayesian IRL step, and a COE prior that rescales rewards within [r_min, r_max] while introducing uncertainty over exploration costs. Empirically, BIG outperforms standard imitation learning in imitation-gap scenarios, recovers reward structure in simple and large CMDPs, and scales to high dimensional observations, demonstrating the practical viability of Bayes-optimal exploration in imitation-limited settings.

Abstract

In many real-world settings, an agent must learn to act in environments where no reward signal can be specified, but a set of expert demonstrations is available. Imitation learning (IL) is a popular framework for learning policies from such demonstrations. However, in some cases, differences in observability between the expert and the agent can give rise to an imitation gap such that the expert's policy is not optimal for the agent and a naive application of IL can fail catastrophically. In particular, if the expert observes the Markov state and the agent does not, then the expert will not demonstrate the information-gathering behavior needed by the agent but not the expert. In this paper, we propose a Bayesian solution to the Imitation Gap (BIG), first using the expert demonstrations, together with a prior specifying the cost of exploratory behavior that is not demonstrated, to infer a posterior over rewards with Bayesian inverse reinforcement learning (IRL). BIG then uses the reward posterior to learn a Bayes-optimal policy. Our experiments show that BIG, unlike IL, allows the agent to explore at test time when presented with an imitation gap, whilst still learning to behave optimally using expert demonstrations when no such gap exists.
Paper Structure (30 sections, 3 theorems, 41 equations, 9 figures, 5 tables, 2 algorithms)

This paper contains 30 sections, 3 theorems, 41 equations, 9 figures, 5 tables, 2 algorithms.

Key Result

Theorem 4.1

Define $\varsigma_0^2 \coloneqq \frac{\sigma_0^2}{\alpha}$. Using the Laplace approximation, the approximate posterior is $p(\omega\vert \mathcal{D}_\textrm{Expert})\approx \mathcal{N}(\omega\vert \omega^\star_\textrm{Laplace},\Sigma_\textrm{Laplace})$ where $\Sigma_\textrm{Laplace}=\nabla^2_\omega

Figures (9)

  • Figure 1: A diagram of the Tiger-Treasure Problem MDP, a classic example of an imitation gap. The agent initially does not know which door the treasure or tiger is behind and must take listening actions to resolve its uncertainty.
  • Figure 2: Schematic of the Bayesian solution to the Imitation Gap (BIG). Prior information is shown in green, algorithms in pink, prior distributions in yellow, and outputs in blue.
  • Figure 3: Evaluation of BIG in the Tiger-Treasure environment. Success rate and time exploring (in steps) for policies learned with a uniform prior reward with different means are represented in yellow ($k^\star < 1$) and in green ($k^\star=1$), while the case with no prior is shown in red. Error bars indicate the standard error of the mean across 10 seeds. The symbol $+\infty$ indicates that, for some trials, the agent explores throughout the entire (infinite) episode.
  • Figure 4: A demonstration of the necessity for latent inference with BIG. On the left, we show the CMDP used in the experiments, with two possibilities for the context $\theta$. On the right, we show the ground truth returns of a DQN agent for trajectories of 100 steps in the CMDP during training. The shading shows the standard error of the mean for 8 seeds.
  • Figure 5: BIG successfully learns the optimal behavior in a challenging gridworld environment. On the left, we show the rewards learned by the contextual IRL. In the middle, we show the return (using the manually constructed reward) of policies trained with reward inferred with and without a reward prior and the manually constructed reward (ground truth). The shading shows the standard error of the mean for 8 random seeds. On the right, we show the final returns of policies trained using different values of $k^\star$ compared to not using the reward prior and using the ground truth reward.
  • ...and 4 more figures

Theorems & Definitions (6)

  • Theorem 4.1
  • proof
  • Lemma 4.1
  • proof
  • Theorem 4.1
  • proof