Table of Contents
Fetching ...

On-Robot Reinforcement Learning with Goal-Contrastive Rewards

Ondrej Biza, Thomas Weng, Lingfeng Sun, Karl Schmeckpeper, Tarik Kelestemur, Yecheng Jason Ma, Robert Platt, Jan-Willem van de Meent, Lawson L. S. Wong

TL;DR

The paper tackles the challenge of sample-inefficient on-robot reinforcement learning by learning dense rewards from passive videos. It introduces Goal-Contrastive Rewards (GCR), which combines implicit state-value learning with goal-contrastive losses to produce a discriminative, adaptive reward signal that guides online RL, including cross-embodiment transfer from human and other-robot videos. GCR integrates with an asynchronous SERL-based framework, enabling parallel reward prediction and RL training and combining intrinsic (learned rewards) and extrinsic (foundation-model) signals. Across six simulated tasks and real-world experiments with a Franka arm and a Spot, GCR significantly improves sample efficiency, solving more tasks with fewer demonstrations and demonstrating positive cross-embodiment transfer, thus offering a scalable route to broad on-robot RL deployment.

Abstract

Reinforcement Learning (RL) has the potential to enable robots to learn from their own actions in the real world. Unfortunately, RL can be prohibitively expensive, in terms of on-robot runtime, due to inefficient exploration when learning from a sparse reward signal. Designing dense reward functions is labour-intensive and requires domain expertise. In our work, we propose GCR (Goal-Contrastive Rewards), a dense reward function learning method that can be trained on passive video demonstrations. By using videos without actions, our method is easier to scale, as we can use arbitrary videos. GCR combines two loss functions, an implicit value loss function that models how the reward increases when traversing a successful trajectory, and a goal-contrastive loss that discriminates between successful and failed trajectories. We perform experiments in simulated manipulation environments across RoboMimic and MimicGen tasks, as well as in the real world using a Franka arm and a Spot quadruped. We find that GCR leads to a more-sample efficient RL, enabling model-free RL to solve about twice as many tasks as our baseline reward learning methods. We also demonstrate positive cross-embodiment transfer from videos of people and of other robots performing a task. Website: https://gcr-robot.github.io/.

On-Robot Reinforcement Learning with Goal-Contrastive Rewards

TL;DR

The paper tackles the challenge of sample-inefficient on-robot reinforcement learning by learning dense rewards from passive videos. It introduces Goal-Contrastive Rewards (GCR), which combines implicit state-value learning with goal-contrastive losses to produce a discriminative, adaptive reward signal that guides online RL, including cross-embodiment transfer from human and other-robot videos. GCR integrates with an asynchronous SERL-based framework, enabling parallel reward prediction and RL training and combining intrinsic (learned rewards) and extrinsic (foundation-model) signals. Across six simulated tasks and real-world experiments with a Franka arm and a Spot, GCR significantly improves sample efficiency, solving more tasks with fewer demonstrations and demonstrating positive cross-embodiment transfer, thus offering a scalable route to broad on-robot RL deployment.

Abstract

Reinforcement Learning (RL) has the potential to enable robots to learn from their own actions in the real world. Unfortunately, RL can be prohibitively expensive, in terms of on-robot runtime, due to inefficient exploration when learning from a sparse reward signal. Designing dense reward functions is labour-intensive and requires domain expertise. In our work, we propose GCR (Goal-Contrastive Rewards), a dense reward function learning method that can be trained on passive video demonstrations. By using videos without actions, our method is easier to scale, as we can use arbitrary videos. GCR combines two loss functions, an implicit value loss function that models how the reward increases when traversing a successful trajectory, and a goal-contrastive loss that discriminates between successful and failed trajectories. We perform experiments in simulated manipulation environments across RoboMimic and MimicGen tasks, as well as in the real world using a Franka arm and a Spot quadruped. We find that GCR leads to a more-sample efficient RL, enabling model-free RL to solve about twice as many tasks as our baseline reward learning methods. We also demonstrate positive cross-embodiment transfer from videos of people and of other robots performing a task. Website: https://gcr-robot.github.io/.

Paper Structure

This paper contains 19 sections, 6 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: A model trained with only a temporal loss $\mathcal{L}_T$ assigns a high state value (left) to the pair of state-goal images (right) due to the arm being in the same pose, whereas a model trained with a combination of $\mathcal{L}_T$ and a contrastive loss $\mathcal{L}_N$ (ours) learns to distinguish the position of the block, assigning a low state value.
  • Figure 2: On-robot reinforcement learning system overview.
  • Figure 3: We show three example states and their associated GCR state values (scaled zero to one) at 0 and 5000 online GCR training steps. The 0 step version of GCR is fine-tuned on demonstrations, but not on any online data. 5000 steps correspond to approximately one hour of on-robot training.
  • Figure 4: Four different drawer opening policies learned by GCR. First: handle grasp, second: top grasp, third: top finger through handle, fourth: bottom grasp. We find that GCR improves exploration but does not prescribe a specific way of opening the drawer, whereas RLPD always converges to the same policy (handle grasp).
  • Figure 5: (Left) Cumulative returns of DrQ trained with different reward functions (x axis). We normalize the cumulative returns by the highest achieved value across all methods and runs. We report four random seeds across three simulated tasks, SERL Pick cube (20 passive demos), RoboMimic Lift (20) and Can (20), and MimicGen Stack D0 (100). (Right) We also add results for real-world Franka Lift, Kettle and Spot drawer opening tasks (Figure \ref{['fig:rw_tasks']}). We run these tasks with only GCR and Sparse rewards.
  • ...and 5 more figures