RoboCLIP: One Demonstration is Enough to Learn Robot Policies
Sumedh A Sontakke, Jesse Zhang, Sébastien M. R. Arnold, Karl Pertsch, Erdem Bıyık, Dorsa Sadigh, Chelsea Finn, Laurent Itti
TL;DR
RoboCLIP introduces a one-demonstration imitation learning paradigm that uses pretrained video-and-language models to generate rewards from a single language or video demonstration, eliminating the need for large expert datasets or manual reward engineering. By embedding both the agent’s episode and a task description into a shared VLM latent space and taking their similarity as the reward, RoboCLIP enables online RL (PPO) to learn robust policies across Metaworld and Franka Kitchen with substantially improved zero-shot performance. The approach supports language-conditioned rewards, in-domain and out-of-domain video demonstrations, and multimodal task specifications, and demonstrates that a single demonstration can yield competitive or superior results with optional finetuning on true task rewards. The findings highlight the practical potential of leveraging large VLMs for reward generation, enabling flexible task specification and reducing annotation burden, while also acknowledging potential biases and stability challenges in real-world deployments.
Abstract
Reward specification is a notoriously difficult problem in reinforcement learning, requiring extensive expert supervision to design robust reward functions. Imitation learning (IL) methods attempt to circumvent these problems by utilizing expert demonstrations but typically require a large number of in-domain expert demonstrations. Inspired by advances in the field of Video-and-Language Models (VLMs), we present RoboCLIP, an online imitation learning method that uses a single demonstration (overcoming the large data requirement) in the form of a video demonstration or a textual description of the task to generate rewards without manual reward function design. Additionally, RoboCLIP can also utilize out-of-domain demonstrations, like videos of humans solving the task for reward generation, circumventing the need to have the same demonstration and deployment domains. RoboCLIP utilizes pretrained VLMs without any finetuning for reward generation. Reinforcement learning agents trained with RoboCLIP rewards demonstrate 2-3 times higher zero-shot performance than competing imitation learning methods on downstream robot manipulation tasks, doing so using only one video/text demonstration.
