RoboCLIP: One Demonstration is Enough to Learn Robot Policies

Sumedh A Sontakke; Jesse Zhang; Sébastien M. R. Arnold; Karl Pertsch; Erdem Bıyık; Dorsa Sadigh; Chelsea Finn; Laurent Itti

RoboCLIP: One Demonstration is Enough to Learn Robot Policies

Sumedh A Sontakke, Jesse Zhang, Sébastien M. R. Arnold, Karl Pertsch, Erdem Bıyık, Dorsa Sadigh, Chelsea Finn, Laurent Itti

TL;DR

RoboCLIP introduces a one-demonstration imitation learning paradigm that uses pretrained video-and-language models to generate rewards from a single language or video demonstration, eliminating the need for large expert datasets or manual reward engineering. By embedding both the agent’s episode and a task description into a shared VLM latent space and taking their similarity as the reward, RoboCLIP enables online RL (PPO) to learn robust policies across Metaworld and Franka Kitchen with substantially improved zero-shot performance. The approach supports language-conditioned rewards, in-domain and out-of-domain video demonstrations, and multimodal task specifications, and demonstrates that a single demonstration can yield competitive or superior results with optional finetuning on true task rewards. The findings highlight the practical potential of leveraging large VLMs for reward generation, enabling flexible task specification and reducing annotation burden, while also acknowledging potential biases and stability challenges in real-world deployments.

Abstract

Reward specification is a notoriously difficult problem in reinforcement learning, requiring extensive expert supervision to design robust reward functions. Imitation learning (IL) methods attempt to circumvent these problems by utilizing expert demonstrations but typically require a large number of in-domain expert demonstrations. Inspired by advances in the field of Video-and-Language Models (VLMs), we present RoboCLIP, an online imitation learning method that uses a single demonstration (overcoming the large data requirement) in the form of a video demonstration or a textual description of the task to generate rewards without manual reward function design. Additionally, RoboCLIP can also utilize out-of-domain demonstrations, like videos of humans solving the task for reward generation, circumventing the need to have the same demonstration and deployment domains. RoboCLIP utilizes pretrained VLMs without any finetuning for reward generation. Reinforcement learning agents trained with RoboCLIP rewards demonstrate 2-3 times higher zero-shot performance than competing imitation learning methods on downstream robot manipulation tasks, doing so using only one video/text demonstration.

RoboCLIP: One Demonstration is Enough to Learn Robot Policies

TL;DR

Abstract

Paper Structure (21 sections, 4 equations, 9 figures)

This paper contains 21 sections, 4 equations, 9 figures.

Introduction
Related Work
Learning from Human Feedback.
Large Vision and Language Models as Reward Functions.
Method
Overview.
Notation.
Reward Generation.
Agent Training.
Experiments
Baselines.
Domain Alignment
Language for Reward Generation
In-Domain Videos for Reward Generation
Quantitative Results.
...and 6 more sections

Figures (9)

Figure 1: RoboCLIP Overview. A Pretrained Video-and-Language Model is used to generate rewards via the similarity score between the encoding of an episode of interaction of an agent in its environment, $\vb{z}^v$ with the encoding of a task specifier $\vb{z}^d$ such as a textual description of the task or a video demonstrating a successful trajectory. The similarity score between the latent vectors is provided as reward to the agent.
Figure 2: Domain Alignment Confusion Matrix. We perform a confusion matrix analysis on a subset of the data on collected on Metaworld yu2020meta environments by comparing the pair-wise similarities between the latent vectors of the strings describing the videos and those of the videos. We find that Metaworld is well-aligned with higher scores along the diagonal than along the off-diagonal elements.
Figure 3: Language-Conditioned Reward Generation. The pretrained VLM is used to generate rewards via the similarity score of the encoding of an episode of interaction of an agent in its environment, $\vb{z}^v$ with the encoding of a task specifier $\vb{z}^d$ specified in natural language. We use the strings, "robot closing black box", "robot closing green drawer" and "robot pushing red button" for conditioning for the 3 environments respectively. We find that agents pretrained on these language-conditioned rewards outperform imitation learning baselines like GAIL ho2016generative and AIRL fu2017learning.
Figure 4: Using In-Domain Videos for Reward Generation. The pretrained VLM is used to generate rewards via the similarity score of the encoding of an episode of interaction of an agent in its environment, $\vb{z}^v$ with the encoding of a video demonstration of expert behavior in the same environment. The similarity score between the latent vectors is provided as reward to the agent and is used to train online RL methods. We study this setup in the Kettle, Hinge and Slide Tasks in the Franka Kitchen Environment gupta2019relay. We find that policies trained on the RoboCLIP reward are able to learn to complete the task in all three setups without any need for external rewards using just a single in-domain demonstration.
Figure 5: Qualitative Inspection of Imitation. The first row in each subfigure shows the visualizations of the demonstration video used for reward generation via the VLM. The second rows are videos taken from policy recovered from training on the RoboCLIP reward generated using the videos in the first rows. The quick swiping motion demonstrated in the Slide demonstration is mimicked well in the resultant policy while the wrist-rotational "trick-shot" behavior in the demonstration for Hinge appears in the resultant learned policy.
...and 4 more figures

RoboCLIP: One Demonstration is Enough to Learn Robot Policies

TL;DR

Abstract

RoboCLIP: One Demonstration is Enough to Learn Robot Policies

Authors

TL;DR

Abstract

Table of Contents

Figures (9)