Watch and Match: Supercharging Imitation with Regularized Optimal Transport

Siddhant Haldar; Vaibhav Mathur; Denis Yarats; Lerrel Pinto

Watch and Match: Supercharging Imitation with Regularized Optimal Transport

Siddhant Haldar, Vaibhav Mathur, Denis Yarats, Lerrel Pinto

TL;DR

Regularized Optimal Transport (ROT) addresses the inefficiency of IRL in imitation learning by adaptively fusing offline Behavior Cloning with online OT-based trajectory matching. The method uses a two-phase protocol: BC pretraining followed by online finetuning with a regularized IRL objective, where soft Q-filtering dynamically tunes BC regularization to stabilize training. ROT demonstrates substantial empirical gains across 20 simulated tasks and 14 real-world robotic tasks, achieving up to $7.8\times$ faster attainment of 90% expert performance and $90.1\%$ real-world success with a single demonstration. The approach offers a practical, high-performing framework for imitation in high-dimensional visual domains and real robots, while highlighting areas for future work on suboptimal or multimodal demonstrations and richer sensing modalities.

Abstract

Imitation learning holds tremendous promise in learning policies efficiently for complex decision making problems. Current state-of-the-art algorithms often use inverse reinforcement learning (IRL), where given a set of expert demonstrations, an agent alternatively infers a reward function and the associated optimal policy. However, such IRL approaches often require substantial online interactions for complex control problems. In this work, we present Regularized Optimal Transport (ROT), a new imitation learning algorithm that builds on recent advances in optimal transport based trajectory-matching. Our key technical insight is that adaptively combining trajectory-matching rewards with behavior cloning can significantly accelerate imitation even with only a few demonstrations. Our experiments on 20 visual control tasks across the DeepMind Control Suite, the OpenAI Robotics Suite, and the Meta-World Benchmark demonstrate an average of 7.8X faster imitation to reach 90% of expert performance compared to prior state-of-the-art methods. On real-world robotic manipulation, with just one demonstration and an hour of online training, ROT achieves an average success rate of 90.1% across 14 tasks.

Watch and Match: Supercharging Imitation with Regularized Optimal Transport

TL;DR

faster attainment of 90% expert performance and

real-world success with a single demonstration. The approach offers a practical, high-performing framework for imitation in high-dimensional visual domains and real robots, while highlighting areas for future work on suboptimal or multimodal demonstrations and richer sensing modalities.

Abstract

Paper Structure (49 sections, 10 equations, 15 figures, 3 tables, 1 algorithm)

This paper contains 49 sections, 10 equations, 15 figures, 3 tables, 1 algorithm.

Introduction
Background
Imitation Learning with Optimal Transport (OT)
Actor-Critic based reward maximization
Challenges in Online Finetuning from a Pretrained Policy
Regularized Optimal Transport
Phase 1: BC Pretraining
Phase 2: Online Finetuning with IRL
Finetuning with Regularization
Adaptive Regularization with Soft Q-filtering
Considerations for image-based observations
Experiments
Simulated tasks
Robot tasks
Primary baselines
...and 34 more sections

Figures (15)

Figure 1: (Top) Regularized Optimal Transport (ROT) is a new imitation learning algorithm that adaptively combines offline behavior cloning with online trajectory-matching based rewards. This enables significantly faster imitation across a variety of simulated and real robotics tasks, while being compatible with high-dimensional visual observation. (Bottom) On our xArm robot, ROT can learn visual policies with only a single human demonstration and under an hour of online training.
Figure 2: (a) Given a single demonstration to avoid the grey obstacle and reach the goal location, BC is unable to solve the task. (b) Finetuning from this BC policy with OT-based reward also fails to solve the task. (c) ROT, with adaptive regularization of OT-based IRL with BC successfully solves the task. (d) Even when the ROT agent is initialized randomly, it is able to solve the task.
Figure 3: Pixel-based continuous control learning on 9 selected environments. Shaded region represents $\pm1$ standard deviation across 5 seeds. We notice that ROT is significantly more sample efficient compared to prior work.
Figure 4: (Top) ROT is evaluated on a set of 14 robotic manipulation tasks. (Bottom) Success rates for each task is computed by running 20 trajectories from varying initial conditions on the robot.
Figure 5: Effect of various BC regularization schemes compared with our adaptive soft-Q filtering regularization.
...and 10 more figures

Watch and Match: Supercharging Imitation with Regularized Optimal Transport

TL;DR

Abstract

Watch and Match: Supercharging Imitation with Regularized Optimal Transport

Authors

TL;DR

Abstract

Table of Contents

Figures (15)