Table of Contents
Fetching ...

Efficient Imitation under Misspecification

Nicolas Espinosa-Dice, Sanjiban Choudhury, Wen Sun, Gokul Swamy

TL;DR

It is proves that in the misspecified setting, it is beneficial to broaden the set of states on which local search is performed to include those reachable by good policies the learner can actually play, and experimentally explores a variety of sources of misspecification.

Abstract

We consider the problem of imitation learning under misspecification: settings where the learner is fundamentally unable to replicate expert behavior everywhere. This is often true in practice due to differences in observation space and action space expressiveness (e.g. perceptual or morphological differences between robots and humans). Given the learner must make some mistakes in the misspecified setting, interaction with the environment is fundamentally required to figure out which mistakes are particularly costly and lead to compounding errors. However, given the computational cost and safety concerns inherent in interaction, we'd like to perform as little of it as possible while ensuring we've learned a strong policy. Accordingly, prior work has proposed a flavor of efficient inverse reinforcement learning algorithms that merely perform a computationally efficient local search procedure with strong guarantees in the realizable setting. We first prove that under a novel structural condition we term reward-agnostic policy completeness, these sorts of local-search based IRL algorithms are able to avoid compounding errors. We then consider the question of where we should perform local search in the first place, given the learner may not be able to "walk on a tightrope" as well as the expert in the misspecified setting. We prove that in the misspecified setting, it is beneficial to broaden the set of states on which local search is performed to include those reachable by good policies the learner can actually play. We then experimentally explore a variety of sources of misspecification and how offline data can be used to effectively broaden where we perform local search from.

Efficient Imitation under Misspecification

TL;DR

It is proves that in the misspecified setting, it is beneficial to broaden the set of states on which local search is performed to include those reachable by good policies the learner can actually play, and experimentally explores a variety of sources of misspecification.

Abstract

We consider the problem of imitation learning under misspecification: settings where the learner is fundamentally unable to replicate expert behavior everywhere. This is often true in practice due to differences in observation space and action space expressiveness (e.g. perceptual or morphological differences between robots and humans). Given the learner must make some mistakes in the misspecified setting, interaction with the environment is fundamentally required to figure out which mistakes are particularly costly and lead to compounding errors. However, given the computational cost and safety concerns inherent in interaction, we'd like to perform as little of it as possible while ensuring we've learned a strong policy. Accordingly, prior work has proposed a flavor of efficient inverse reinforcement learning algorithms that merely perform a computationally efficient local search procedure with strong guarantees in the realizable setting. We first prove that under a novel structural condition we term reward-agnostic policy completeness, these sorts of local-search based IRL algorithms are able to avoid compounding errors. We then consider the question of where we should perform local search in the first place, given the learner may not be able to "walk on a tightrope" as well as the expert in the misspecified setting. We prove that in the misspecified setting, it is beneficial to broaden the set of states on which local search is performed to include those reachable by good policies the learner can actually play. We then experimentally explore a variety of sources of misspecification and how offline data can be used to effectively broaden where we perform local search from.

Paper Structure

This paper contains 56 sections, 12 theorems, 60 equations, 8 figures, 3 tables, 4 algorithms.

Key Result

Theorem 3.1

Assume $\Pi$ is finite, but $\pi_E \notin \Pi$. With probability at least $1 - \delta$, STILE learns a policy $\hat{\pi}$ such that:

Figures (8)

  • Figure 1: Maze construction under misspecification. The expert's trajectory is shown in green, and the learner's trajectory is shown in orange. The green circle represents the goal position that returns the maximum possible reward, while the purple circle represents the goal position that returns the maximum realizable reward.
  • Figure 3: Changing dynamics. By resetting to states from a realizable policy, GUITAR shows a speedup over traditional inverse RL methods on a problem where the misspecification is due to a dynamics mismatch. Standard errors are computed across 5 seeds.
  • Figure 4: We plot the coverage of the D4RL Antmaze-Large expert data, including the full dataset, short trajectories (episodes shorter than 500 steps), and long trajectories (episodes 500 steps or longer).
  • Figure 5: Resets to subsets of $\pi^{\star}$'s state distribution. We consider whether it is necessary, as the theory suggests, to reset to a distribution that covers $\pi^{\star}$'s state distribution. In this experiment, we reset to subsets of $\pi^{\star}$'s state distribution and compare their performance. The performance of $\texttt{GUITAR}(D_{\text{short}})$ matches the performance of $\texttt{FILTER}(D_{\text{full}})$, showing that full coverage of $\pi^{\star}$'s state distribution is not necessary in practice. Standard errors are computed across 10 seeds. During evaluation, agents sample a random action with probability $p_{\text{tremble}}$.
  • Figure 6: Environment without arbitrary reset access. Standard errors are computed across 5 seeds. Expert data is a partial trajectory (i.e. a subset of one full trajectory).
  • ...and 3 more figures

Theorems & Definitions (21)

  • Theorem 3.1: Sample Complexity of STILE under Misspecification
  • Theorem 3.2: Lower Bound on Misspecified RL with Expert Feedback jia2024agnostic
  • Definition 4.1: Reward-Indexed Policy Completeness Error
  • Definition 4.2: Reward-Agnostic Policy Completeness Error
  • Theorem 4.3: Sample Complexity of GUITAR
  • Corollary 5.1: Benefit of Offline Data
  • Theorem B.1: Sample Complexity of STILE
  • proof
  • proof
  • proof
  • ...and 11 more