Table of Contents
Fetching ...

Feedback in Imitation Learning: The Three Regimes of Covariate Shift

Jonathan Spencer, Sanjiban Choudhury, Arun Venkatraman, Brian Ziebart, J. Andrew Bagnell

TL;DR

This work analyzes why imitation learning often exhibits a gap between hold-out error and in-situ performance due to feedback-driven covariate shift. It argues that a broad class of problems, especially with action-history effects, can be mitigated by leveraging a simulator and cached expert demonstrations rather than online queries. The authors introduce ALICE, a framework with multiple loss families (ALICE-Cov, Fail, ALICE-Cov-Fail) and forward-training variants to bound regret in a Goldilocks regime where density ratios between learner and expert distributions are finite. They also critique common IL benchmarks as too easy to expose covariate-shift phenomena and advocate for standardized, IL-centric benchmarks. Overall, the paper provides theory and algorithms showing how covariate shift can be addressed without expert queries, while outlining practical directions for benchmarks and recoverability conditions.

Abstract

Imitation learning practitioners have often noted that conditioning policies on previous actions leads to a dramatic divergence between "held out" error and performance of the learner in situ. Interactive approaches can provably address this divergence but require repeated querying of a demonstrator. Recent work identifies this divergence as stemming from a "causal confound" in predicting the current action, and seek to ablate causal aspects of current state using tools from causal inference. In this work, we argue instead that this divergence is simply another manifestation of covariate shift, exacerbated particularly by settings of feedback between decisions and input features. The learner often comes to rely on features that are strongly predictive of decisions, but are subject to strong covariate shift. Our work demonstrates a broad class of problems where this shift can be mitigated, both theoretically and practically, by taking advantage of a simulator but without any further querying of expert demonstration. We analyze existing benchmarks used to test imitation learning approaches and find that these benchmarks are realizable and simple and thus insufficient for capturing the harder regimes of error compounding seen in real-world decision making problems. We find, in a surprising contrast with previous literature, but consistent with our theory, that naive behavioral cloning provides excellent results. We detail the need for new standardized benchmarks that capture the phenomena seen in robotics problems.

Feedback in Imitation Learning: The Three Regimes of Covariate Shift

TL;DR

This work analyzes why imitation learning often exhibits a gap between hold-out error and in-situ performance due to feedback-driven covariate shift. It argues that a broad class of problems, especially with action-history effects, can be mitigated by leveraging a simulator and cached expert demonstrations rather than online queries. The authors introduce ALICE, a framework with multiple loss families (ALICE-Cov, Fail, ALICE-Cov-Fail) and forward-training variants to bound regret in a Goldilocks regime where density ratios between learner and expert distributions are finite. They also critique common IL benchmarks as too easy to expose covariate-shift phenomena and advocate for standardized, IL-centric benchmarks. Overall, the paper provides theory and algorithms showing how covariate shift can be addressed without expert queries, while outlining practical directions for benchmarks and recoverability conditions.

Abstract

Imitation learning practitioners have often noted that conditioning policies on previous actions leads to a dramatic divergence between "held out" error and performance of the learner in situ. Interactive approaches can provably address this divergence but require repeated querying of a demonstrator. Recent work identifies this divergence as stemming from a "causal confound" in predicting the current action, and seek to ablate causal aspects of current state using tools from causal inference. In this work, we argue instead that this divergence is simply another manifestation of covariate shift, exacerbated particularly by settings of feedback between decisions and input features. The learner often comes to rely on features that are strongly predictive of decisions, but are subject to strong covariate shift. Our work demonstrates a broad class of problems where this shift can be mitigated, both theoretically and practically, by taking advantage of a simulator but without any further querying of expert demonstration. We analyze existing benchmarks used to test imitation learning approaches and find that these benchmarks are realizable and simple and thus insufficient for capturing the harder regimes of error compounding seen in real-world decision making problems. We find, in a surprising contrast with previous literature, but consistent with our theory, that naive behavioral cloning provides excellent results. We detail the need for new standardized benchmarks that capture the phenomena seen in robotics problems.

Paper Structure

This paper contains 22 sections, 7 theorems, 27 equations, 4 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1

Let $\mathbb{E}_{s \sim \rho_{\pi}}\left[\ell^{\mathrm{cs}}(\hat{\pi}(s), \pi^*(s))\right] \leq \epsilon$ be the bounded on-policy training error, where $\ell^{\mathrm{cs}}$ is the 0-1 loss (or an upper bound). We have $J(\hat{\pi}) \leq J(\pi^*) + T^2 \epsilon$

Figures (4)

  • Figure 1: A common example of feedback-driven covariate shift in self-driving. At train time, the robot learns that the previous action (Brake) accurately predicts the current action almost all the time. At test time, when the learner mistakenly chooses to Brake, it continues to choose Brake, creating a bad feedback cycle that causes it to diverge from the expert.
  • Figure 2: Inherent feedback in sequential decision making tasks. Past action $a_{t-1}$ affects current action $a_t$, either indirectly (blue) via MDP dynamics or directly (green) via explicit conditioning.
  • Figure 3: Spectrum of feedback driven covariate shift regimes. Consider the case of training a UAV to fly through a forest using demonstrations (blue). In Easy regime, the demonstrator is realizable, while in the Goldilocks and Hard regime, the learner (yellow) is confined to a more restrictive policy class. While model mispecification usually requires interactive demonstrations, in the Goldilocks regime, ALICE achieves $O(T\epsilon)$ without interactive query.
  • Figure 4: Three different MDPs with varying recoverability regimes. For all MDPs, $C(s_1)=0$ and $C(s) = 1$ for all $s \neq s_1$. The expert deterministic policy is therefore $\pi^*(s_1)=a_1$ and $\pi^*(s)=a_2$ for all $s \neq s_1$. Even with one-step recoverability, BC can still result in $O(T^2\epsilon)$ error. For $>1$-step recoverability, even Fail slides to $O(T^2\epsilon)$, while DAgger can recover in $k$ steps leading to $O(kT\epsilon)$. For unrecoverable problem, all algorithms can go upto $O(T^2\epsilon)$. Hence recoverability dictates the lower-bound of how well we can do in the model misspecified regime.

Theorems & Definitions (18)

  • Theorem 1: Theorem 2.1 in ross2011reduction
  • Theorem 2: Theorem 2.2 in ross2011reduction
  • Theorem 3: BC in Goldilocks regime
  • proof
  • Theorem 4: ALICE-Cov
  • proof
  • Corollary 5.1
  • proof
  • Definition 5.1: One-step Recoverability
  • Theorem 5: Fail
  • ...and 8 more