Table of Contents
Fetching ...

Stable-BC: Controlling Covariate Shift with Stable Behavior Cloning

Shaunak A. Mehta, Yusuf Umut Ciftci, Balamurugan Ramachandran, Somil Bansal, Dylan P. Losey

TL;DR

This work tackles covariate shift in behavior cloning by formulating the problem with error dynamics between current and demonstrated trajectories. By linearizing the error dynamics to obtain $\\dot z = A z$, it derives local stability conditions and introduces Stable-BC, which augments the standard BC loss with a stability term to encourage convergence toward expert behaviors. In model-based settings, the full $A$ matrix is used and stability is enforced via an eigenvalue-penalty; in model-free settings, bounded stability is achieved by controlling $A_1$ and minimizing $\\|A_2\\|$, yielding a data-efficient approach. Empirical results across interactive driving, nonlinear quadrotor navigation, visual perception tasks, and a real air hockey experiment show that Stable-BC improves robustness to covariate shift and can reduce the required demonstration data while producing smoother, more reliable policies.

Abstract

Behavior cloning is a common imitation learning paradigm. Under behavior cloning the robot collects expert demonstrations, and then trains a policy to match the actions taken by the expert. This works well when the robot learner visits states where the expert has already demonstrated the correct action; but inevitably the robot will also encounter new states outside of its training dataset. If the robot learner takes the wrong action at these new states it could move farther from the training data, which in turn leads to increasingly incorrect actions and compounding errors. Existing works try to address this fundamental challenge by augmenting or enhancing the training data. By contrast, in our paper we develop the control theoretic properties of behavior cloned policies. Specifically, we consider the error dynamics between the system's current state and the states in the expert dataset. From the error dynamics we derive model-based and model-free conditions for stability: under these conditions the robot shapes its policy so that its current behavior converges towards example behaviors in the expert dataset. In practice, this results in Stable-BC, an easy to implement extension of standard behavior cloning that is provably robust to covariate shift. We demonstrate the effectiveness of our algorithm in simulations with interactive, nonlinear, and visual environments. We also conduct experiments where a robot arm uses Stable-BC to play air hockey. See our website here: https://collab.me.vt.edu/Stable-BC/

Stable-BC: Controlling Covariate Shift with Stable Behavior Cloning

TL;DR

This work tackles covariate shift in behavior cloning by formulating the problem with error dynamics between current and demonstrated trajectories. By linearizing the error dynamics to obtain , it derives local stability conditions and introduces Stable-BC, which augments the standard BC loss with a stability term to encourage convergence toward expert behaviors. In model-based settings, the full matrix is used and stability is enforced via an eigenvalue-penalty; in model-free settings, bounded stability is achieved by controlling and minimizing , yielding a data-efficient approach. Empirical results across interactive driving, nonlinear quadrotor navigation, visual perception tasks, and a real air hockey experiment show that Stable-BC improves robustness to covariate shift and can reduce the required demonstration data while producing smoother, more reliable policies.

Abstract

Behavior cloning is a common imitation learning paradigm. Under behavior cloning the robot collects expert demonstrations, and then trains a policy to match the actions taken by the expert. This works well when the robot learner visits states where the expert has already demonstrated the correct action; but inevitably the robot will also encounter new states outside of its training dataset. If the robot learner takes the wrong action at these new states it could move farther from the training data, which in turn leads to increasingly incorrect actions and compounding errors. Existing works try to address this fundamental challenge by augmenting or enhancing the training data. By contrast, in our paper we develop the control theoretic properties of behavior cloned policies. Specifically, we consider the error dynamics between the system's current state and the states in the expert dataset. From the error dynamics we derive model-based and model-free conditions for stability: under these conditions the robot shapes its policy so that its current behavior converges towards example behaviors in the expert dataset. In practice, this results in Stable-BC, an easy to implement extension of standard behavior cloning that is provably robust to covariate shift. We demonstrate the effectiveness of our algorithm in simulations with interactive, nonlinear, and visual environments. We also conduct experiments where a robot arm uses Stable-BC to play air hockey. See our website here: https://collab.me.vt.edu/Stable-BC/
Paper Structure (13 sections, 12 equations, 5 figures, 1 algorithm)

This paper contains 13 sections, 12 equations, 5 figures, 1 algorithm.

Figures (5)

  • Figure 1: Robot playing air hockey by behavior cloning demonstrations $\mathcal{D}$. The robot can successfully hit the puck when it moves at an angle and velocity observed during training ($\in \mathcal{D}$). However, when the puck moves at new angles or velocities ($\notin \mathcal{D}$), standard behavior cloning (BC) misses the puck entirely because of covariate shift. To address this problem we introduce Stable-BC, a variant of BC that encourages the system state to evolve similarly to the expert's demonstrated behaviors.
  • Figure 2: Simulation results from interactive driving. (Left) An example rollout using BC and Stable-BC. With BC the autonomous car gets stuck in the middle of the intersection. By contrast, when using Stable-BC the autonomous car lets the human pass and then crosses afterwards, resulting in a lower cost. (Right) Average cost over $100$ trials as a function of the number of expert demonstrations. In the left column the testing environment matches the training environment. In the middle column the human agent ignores the autonomous car, and in the right column the autonomous car starts from initial states outside of its training distribution. Shaded regions show SEM. Ideal cost is the best-case scenario where the autonomous car's learned policy exactly matches the policy of the human teacher. In the bottom row we plot Stable-BC (solid orange) and CCIL + Stable-BC (dashed orange).
  • Figure 3: Simulation results for nonlinear quadrotor navigation. (Left) An example trajectory of the quadrotor flying around the $3$D obstacles to reach its goal position. (Right) Average success rate of the quadrotor. We trained the system end-to-end $10$ separate times, and then performed $100$ test rollouts with each trained model. Shaded regions show SEM.
  • Figure 4: Simulation results for visual observations. (Left) The robot is trying to reach a goal. At each timestep the robot observes image $y$ where the goal position is marked by a white pixel; here we show an example of one of these images. The goal position and robot position are randomly sampled at the start of each new interaction. (Right) Average distance between the goal and the robot's final position over $25$ trials. Shaded regions show SEM.
  • Figure 5: Results for the air hockey experiment in Section \ref{['sec:experiments']}. (Left) Participants teleoperated a $7$ DoF robot arm to hit the puck. We collected their demonstration data offline, and then used this data to train BC and Stable-BC policies. (Center) We measured the number of successful hits with different amounts of training data. Ideally, a robust robot policy will repeatedly hit the puck, even when that puck travels with previously unseen angles and velocities. Both BC and Stable-BC eventually converged to equivalent performance, but Stable-BC reached that performance with a smaller amount of training data. (Right) To qualitatively assess the learned behavior, we also measured the number of direction changes per successful hit. Stable-BC produced policies that were more smooth and consistent, with fewer direction changes than BC. Error bars show SEM and $*$ denotes statistical significance ($p < 0.05$).