Table of Contents
Fetching ...

Towards Improving Learning from Demonstration Algorithms via MCMC Methods

Carl Qi, Edward Sun, Harry Zhang

TL;DR

This work reframes learning from demonstrations as an energy-based implicit policy problem, training $E_\theta(s,a)$ with InfoNCE and sampling-based inference to overcome the limitations of traditional BC. Through gradient-based trajectory optimization, it generates expert-like demonstrations in a differentiable dough-manipulation simulator, then compares gradient-free and Langevin MCMC inference for the implicit policy. Results show that implicit BC, particularly with Langevin dynamics, outperforms explicit BC and a soft-actor-critic baseline on contact-rich, deformable-object tasks, and generalizes to unseen configurations. The findings highlight the value of explicit probabilistic modeling and MCMC-based sampling for robust, on-policy learning from demonstrations in high-dimensional, multimodal action spaces.

Abstract

Behavioral cloning, or more broadly, learning from demonstrations (LfD) is a priomising direction for robot policy learning in complex scenarios. Albeit being straightforward to implement and data-efficient, behavioral cloning has its own drawbacks, limiting its efficacy in real robot setups. In this work, we take one step towards improving learning from demonstration algorithms by leveraging implicit energy-based policy models. Results suggest that in selected complex robot policy learning scenarios, treating supervised policy learning with an implicit model generally performs better, on average, than commonly used neural network-based explicit models, especially in the cases of approximating potentially discontinuous and multimodal functions.

Towards Improving Learning from Demonstration Algorithms via MCMC Methods

TL;DR

This work reframes learning from demonstrations as an energy-based implicit policy problem, training with InfoNCE and sampling-based inference to overcome the limitations of traditional BC. Through gradient-based trajectory optimization, it generates expert-like demonstrations in a differentiable dough-manipulation simulator, then compares gradient-free and Langevin MCMC inference for the implicit policy. Results show that implicit BC, particularly with Langevin dynamics, outperforms explicit BC and a soft-actor-critic baseline on contact-rich, deformable-object tasks, and generalizes to unseen configurations. The findings highlight the value of explicit probabilistic modeling and MCMC-based sampling for robust, on-policy learning from demonstrations in high-dimensional, multimodal action spaces.

Abstract

Behavioral cloning, or more broadly, learning from demonstrations (LfD) is a priomising direction for robot policy learning in complex scenarios. Albeit being straightforward to implement and data-efficient, behavioral cloning has its own drawbacks, limiting its efficacy in real robot setups. In this work, we take one step towards improving learning from demonstration algorithms by leveraging implicit energy-based policy models. Results suggest that in selected complex robot policy learning scenarios, treating supervised policy learning with an implicit model generally performs better, on average, than commonly used neural network-based explicit models, especially in the cases of approximating potentially discontinuous and multimodal functions.
Paper Structure (17 sections, 11 equations, 4 figures, 1 algorithm)

This paper contains 17 sections, 11 equations, 4 figures, 1 algorithm.

Figures (4)

  • Figure 1: We perform trajectory optimization to obtain expert demonstrations. We first compute the loss between a target state and the states in our trajectory. We then back-propagate the gradient from the target shape through a differentiable simulator to get the updated actions.
  • Figure 2: Baseline method: Implicit Behavioral Cloning. The energy-based model over states and actions is trained via an InfoNCE-style loss function. The inference is done by a MCMC sampling-based optimization procedure.
  • Figure 3: Normalized performance on 10 held out configurations.
  • Figure 4: Performance of implicit BC with Langevin Dynamics over all configurations in the demonstration data. The policy is robust to unseen configurations (triangles).