Is Behavior Cloning All You Need? Understanding Horizon in Imitation Learning

Dylan J. Foster; Adam Block; Dipendra Misra

Is Behavior Cloning All You Need? Understanding Horizon in Imitation Learning

Dylan J. Foster, Adam Block, Dipendra Misra

TL;DR

This work investigates horizon dependence in imitation learning, challenging the belief that online IL is always substantially more sample-efficient than offline BC. By analyzing log-loss behavior cloning (LogLossBC) under horizon normalization, the authors establish horizon-independent offline sample complexity for deterministic policies and uncover variance-dependent bounds for stochastic experts, revealing a nuanced offline-online gap. They show online IL offers benefits mainly in specific policy-class situations (e.g., no parameter sharing) and provide mechanisms—representational gains, value-based feedback, and exploration—that can enhance online performance. Theoretical results are complemented by experiments across RL tasks and autoregressive language modeling, confirming horizon-agnostic performance in practice and guiding future exploration of horizon-free IL with general function classes.

Abstract

Imitation learning (IL) aims to mimic the behavior of an expert in a sequential decision making task by learning from demonstrations, and has been widely applied to robotics, autonomous driving, and autoregressive text generation. The simplest approach to IL, behavior cloning (BC), is thought to incur sample complexity with unfavorable quadratic dependence on the problem horizon, motivating a variety of different online algorithms that attain improved linear horizon dependence under stronger assumptions on the data and the learner's access to the expert. We revisit the apparent gap between offline and online IL from a learning-theoretic perspective, with a focus on the realizable/well-specified setting with general policy classes up to and including deep neural networks. Through a new analysis of behavior cloning with the logarithmic loss, we show that it is possible to achieve horizon-independent sample complexity in offline IL whenever (i) the range of the cumulative payoffs is controlled, and (ii) an appropriate notion of supervised learning complexity for the policy class is controlled. Specializing our results to deterministic, stationary policies, we show that the gap between offline and online IL is smaller than previously thought: (i) it is possible to achieve linear dependence on horizon in offline IL under dense rewards (matching what was previously only known to be achievable in online IL); and (ii) without further assumptions on the policy class, online IL cannot improve over offline IL with the logarithmic loss, even in benign MDPs. We complement our theoretical results with experiments on standard RL tasks and autoregressive language generation to validate the practical relevance of our findings.

Is Behavior Cloning All You Need? Understanding Horizon in Imitation Learning

TL;DR

Abstract

Paper Structure (93 sections, 44 theorems, 222 equations, 6 figures, 2 tables)

This paper contains 93 sections, 44 theorems, 222 equations, 6 figures, 2 tables.

Introduction
Background: Offline and Online Imitation Learning
Markov decision processes
Reward normalization
Offline Imitation Learning: Behavior Cloning
Behavior cloning
Online Imitation Learning and Recoverability
Contributions
Toward a learning-theoretic understanding of imitation learning
Experiments
Paper Organization
Notation
Horizon-Independent Analysis of Log-Loss Behavior Cloning
Log-Loss Behavior Cloning and Supervised Learning Guarantees
Horizon-Independent Analysis of LogLossBC for Deterministic Experts
...and 78 more sections

Key Result

proposition 1

For any (potentially stochastic) expert $\pi^{\star}\in\Pi$, the LogLossBC algorithm in eq:log_loss_bc ensures that with probability at least $1-\delta$,

Figures (6)

Figure 1: Suboptimality of a policy learned with log-loss behavior cloning (LogLossBC) as a function of the number of expert trajectories, for varying values of horizon $H$. In each environment, an imitator is trained according to LogLossBC and the regret with respect to the expert is reported, with reward normalized to be horizon-independent. (a) Continuous control with MuJoCo environment Walker2d-v4. (b) Discrete control with Atari environment BeamRiderNoFrameskip-v4. For both environments, we find that the regret is independent of horizon (or in the case of Atari, slightly improving with horizon), as predicted by our theoretical results. Full experimental details are provided in \ref{['sec:experiments']}.
Figure 2: Dependence of expected regret on the horizon for multiple choices for the number of imitator trajectories $n$. (a) Continuous control environment Walker2d-v4. (b) Discrete Atari environment BeamriderNoFrameskip-v4. For both environments, increasing the horizon does not lead to a significant increase in regret, as predicted by our theory.
Figure 3: (a) Relationship between the number of expert trajectories and expected regret for the Dyck environment multiple choices of horizon $H$. The expert is trained to produce valid Dyck words of length $H$, and the imitator's ability to generate a valid word is evaluated. We find that regret increases as a function of $H$. (b) Logarithm of the product of weight matrix norms for the expert policy network as a function of $H$, for Dyck and Car environments. The log-product-norm acts as a proxy for complexity for the class $\Pi$; we rescale such that log-product-norm at $H=10$ is $1.0$ for both domains. For Dyck, we find that as $H$ increases, the complexity of $\Pi$ required to represent the expert policy (as measured by the log-product-norm) also increases, explaining the increasing regret in (a). However, the gain in log-product-norm for the Car domain is much lower, which is in line with the fact that the regret for the Car domain exhibits only mild scaling with horizon.
Figure 4: Dependence of expected regret on the number of expert trajectories for Car environment under varying values for horizon $H$ for log-loss (a) and mean-squared loss (b). The expert policy network is trained on a set of $2\times 10^4$ episodes generated by an optimal policy via behavior cloning. We use LogLossBC to train imitator policy for varying values of the horizon $H$ and number of trajectories $n$. For both losses, we find that the expected regret goes down as the number of expert trajectories increases, but degrades slightly as a function of $H$.
Figure 5: Dependence of expected regret on the number of expert trajectories for continuous control environment Walker2d-v4 under varying choices for horizon $H$. (Left) Behavior cloning with logarithmic loss (LogLossBC); (Right) Behavior cloning with mean squared error (MSE) Loss. Both losses lead to similar performance for this environment, possibly due to Gaussian policy parameterization.
...and 1 more figures

Theorems & Definitions (82)

definition 1: Recoverability parameter
proposition 1: Supervised learning guarantee for LogLossBC (special case of \ref{['thm:bc_generalization']})
theorem 1: Horizon-independent regret decomposition (deterministic case)
corollary 1: Regret of LogLossBC (deterministic case)
theorem 2: Lower bound for deterministic experts
proposition 2: Special case of \ref{['prop:dagger_finite']}
remark 1: Known dynamics and inverse RL
theorem 3: Horizon-independent regret decomposition
corollary 2: Regret of LogLossBC
proposition 3
...and 72 more

Is Behavior Cloning All You Need? Understanding Horizon in Imitation Learning

TL;DR

Abstract

Is Behavior Cloning All You Need? Understanding Horizon in Imitation Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (82)