Table of Contents
Fetching ...

Learning Language-Conditioned Robot Behavior from Offline Data and Crowd-Sourced Annotation

Suraj Nair, Eric Mitchell, Kevin Chen, Brian Ichter, Silvio Savarese, Chelsea Finn

TL;DR

This work tackles grounding natural language instructions for vision-based robotic manipulation from offline, potentially sub-optimal data annotated via crowd-sourcing. It introduces LOReL, a framework that grounds language by learning a binary reward function $\mathcal{R}_\theta: \mathcal{S}\times\mathcal{S}\times\mathcal{L}\to[0,1]$ from trajectories and then executes tasks with a learned visual dynamics model using model-predictive control to maximize expected reward $\mathbb{E}[\sum_t \mathcal{R}_\theta(s_0,s_t,l)]$. Key contributions include a scalable pipeline for offline language grounding, empirical demonstration that LOReL outperforms goal-image and language-imitation baselines by over $25\%$, and the ability to generalize to unseen commands via a pre-trained language model, including zero-shot evaluation. Real-robot experiments with crowd-sourced annotations show LOReL solving five language-conditioned visuomotor skills on a Franka Emika Panda with $66\%$ average success, and ablations confirming the role of negative examples and data cleaning. Overall, the approach enables scalable, language-grounded robot control from diverse offline data, offering practical impact for flexible, language-driven robotics.

Abstract

We study the problem of learning a range of vision-based manipulation tasks from a large offline dataset of robot interaction. In order to accomplish this, humans need easy and effective ways of specifying tasks to the robot. Goal images are one popular form of task specification, as they are already grounded in the robot's observation space. However, goal images also have a number of drawbacks: they are inconvenient for humans to provide, they can over-specify the desired behavior leading to a sparse reward signal, or under-specify task information in the case of non-goal reaching tasks. Natural language provides a convenient and flexible alternative for task specification, but comes with the challenge of grounding language in the robot's observation space. To scalably learn this grounding we propose to leverage offline robot datasets (including highly sub-optimal, autonomously collected data) with crowd-sourced natural language labels. With this data, we learn a simple classifier which predicts if a change in state completes a language instruction. This provides a language-conditioned reward function that can then be used for offline multi-task RL. In our experiments, we find that on language-conditioned manipulation tasks our approach outperforms both goal-image specifications and language conditioned imitation techniques by more than 25%, and is able to perform visuomotor tasks from natural language, such as "open the right drawer" and "move the stapler", on a Franka Emika Panda robot.

Learning Language-Conditioned Robot Behavior from Offline Data and Crowd-Sourced Annotation

TL;DR

This work tackles grounding natural language instructions for vision-based robotic manipulation from offline, potentially sub-optimal data annotated via crowd-sourcing. It introduces LOReL, a framework that grounds language by learning a binary reward function from trajectories and then executes tasks with a learned visual dynamics model using model-predictive control to maximize expected reward . Key contributions include a scalable pipeline for offline language grounding, empirical demonstration that LOReL outperforms goal-image and language-imitation baselines by over , and the ability to generalize to unseen commands via a pre-trained language model, including zero-shot evaluation. Real-robot experiments with crowd-sourced annotations show LOReL solving five language-conditioned visuomotor skills on a Franka Emika Panda with average success, and ablations confirming the role of negative examples and data cleaning. Overall, the approach enables scalable, language-grounded robot control from diverse offline data, offering practical impact for flexible, language-driven robotics.

Abstract

We study the problem of learning a range of vision-based manipulation tasks from a large offline dataset of robot interaction. In order to accomplish this, humans need easy and effective ways of specifying tasks to the robot. Goal images are one popular form of task specification, as they are already grounded in the robot's observation space. However, goal images also have a number of drawbacks: they are inconvenient for humans to provide, they can over-specify the desired behavior leading to a sparse reward signal, or under-specify task information in the case of non-goal reaching tasks. Natural language provides a convenient and flexible alternative for task specification, but comes with the challenge of grounding language in the robot's observation space. To scalably learn this grounding we propose to leverage offline robot datasets (including highly sub-optimal, autonomously collected data) with crowd-sourced natural language labels. With this data, we learn a simple classifier which predicts if a change in state completes a language instruction. This provides a language-conditioned reward function that can then be used for offline multi-task RL. In our experiments, we find that on language-conditioned manipulation tasks our approach outperforms both goal-image specifications and language conditioned imitation techniques by more than 25%, and is able to perform visuomotor tasks from natural language, such as "open the right drawer" and "move the stapler", on a Franka Emika Panda robot.

Paper Structure

This paper contains 36 sections, 1 equation, 18 figures, 3 tables.

Figures (18)

  • Figure 1: We learn language-conditioned visuomotor policies using sub-optimal offline data, crowd-sourced annotation, and pre-trained language models, enabling a real robot to complete language-specified tasks while being robust to complex rephrasings of the task description.
  • Figure 2: Language-conditioned Offline Reward Learning (LOReL). We propose a technique to learn language-conditioned behavior from offline datasets of robot interaction (left). To do so, we crowd-source natural language annotations describing the behavior in the offline data, and use it to learn a language-conditioned reward function (middle). We then combine this reward and a learned visual dynamics model through model predictive control to complete language specified tasks form vision on a real robot (right).
  • Figure 3: Training LOReL. We train LOReL on balanced batches of positive examples where the initial/final image transition satisfies the language command (left), negative examples where the initial/final states satisfy a different instruction (middle), and negative examples where the initial and final image are reversed (right).
  • Figure 4: Executing Language-Conditioned Policies with LOReL. To execute language-conditioned behavior, we perform model predictive control with a learned visual dynamics model and LOReL. Specifically, from the initial state we predict many future states for different action sequences (left/middle). We then rank those sequences according to the LOReL reward for the user specified natural language instruction (middle). After multiple iterations, the best action sequence is stepped in the environment executing the task (right).
  • Figure 5: Simulated Domain/Data. We leverage a simulated domain developed on top of Meta-World yu2020meta which contains a Sawyer robot interacting with a drawer, faucet, and two mugs (left). We collect data using a random policy, and annotate episodes with language instructions using the environment state (right).
  • ...and 13 more figures