Table of Contents
Fetching ...

PROGRESSOR: A Perceptually Guided Reward Estimator with Self-Supervised Online Refinement

Tewodros Ayalew, Xiao Zhang, Kevin Yuanbo Wu, Tianchong Jiang, Michael Maire, Matthew R. Walter

TL;DR

Pretrained on large-scale egocentric human video from EPIC-KITCHENS, PROGRESSOR requires no fine-tuning on in-domain task-specific data for generalization to real-robot offline RL under noisy demonstrations, outperforming contemporary methods that provide dense visual reward for robotic learning.

Abstract

We present PROGRESSOR, a novel framework that learns a task-agnostic reward function from videos, enabling policy training through goal-conditioned reinforcement learning (RL) without manual supervision. Underlying this reward is an estimate of the distribution over task progress as a function of the current, initial, and goal observations that is learned in a self-supervised fashion. Crucially, PROGRESSOR refines rewards adversarially during online RL training by pushing back predictions for out-of-distribution observations, to mitigate distribution shift inherent in non-expert observations. Utilizing this progress prediction as a dense reward together with an adversarial push-back, we show that PROGRESSOR enables robots to learn complex behaviors without any external supervision. Pretrained on large-scale egocentric human video from EPIC-KITCHENS, PROGRESSOR requires no fine-tuning on in-domain task-specific data for generalization to real-robot offline RL under noisy demonstrations, outperforming contemporary methods that provide dense visual reward for robotic learning. Our findings highlight the potential of PROGRESSOR for scalable robotic applications where direct action labels and task-specific rewards are not readily available.

PROGRESSOR: A Perceptually Guided Reward Estimator with Self-Supervised Online Refinement

TL;DR

Pretrained on large-scale egocentric human video from EPIC-KITCHENS, PROGRESSOR requires no fine-tuning on in-domain task-specific data for generalization to real-robot offline RL under noisy demonstrations, outperforming contemporary methods that provide dense visual reward for robotic learning.

Abstract

We present PROGRESSOR, a novel framework that learns a task-agnostic reward function from videos, enabling policy training through goal-conditioned reinforcement learning (RL) without manual supervision. Underlying this reward is an estimate of the distribution over task progress as a function of the current, initial, and goal observations that is learned in a self-supervised fashion. Crucially, PROGRESSOR refines rewards adversarially during online RL training by pushing back predictions for out-of-distribution observations, to mitigate distribution shift inherent in non-expert observations. Utilizing this progress prediction as a dense reward together with an adversarial push-back, we show that PROGRESSOR enables robots to learn complex behaviors without any external supervision. Pretrained on large-scale egocentric human video from EPIC-KITCHENS, PROGRESSOR requires no fine-tuning on in-domain task-specific data for generalization to real-robot offline RL under noisy demonstrations, outperforming contemporary methods that provide dense visual reward for robotic learning. Our findings highlight the potential of PROGRESSOR for scalable robotic applications where direct action labels and task-specific rewards are not readily available.

Paper Structure

This paper contains 26 sections, 8 equations, 13 figures, 5 tables, 1 algorithm.

Figures (13)

  • Figure 1: Trained in a self-supervised manner on expert videos, Progressor predicts an agent's progress toward task completion, providing a reward signal for reinforcement learning. During online reinforcement learning, we employ an adversarial technique to refine this reward estimate, addressing the distribution shift between expert data and non-expert online rollouts.
  • Figure 2: Top Left: Initial phase of reward model pretraining on expert data, where the model learns to predict the parameters of a Gaussian distribution centered on normalized progress, reflecting expected progress as demonstrated by experts. Top Right: In online reinforcement learning (RL) training, an adversarial online refinement (i.e., push-back) is applied to counteract non-expert predictions made by the reward model, effectively distinguishing expert from non-expert progress. Bottom: During online RL, the reward model is updated on expert and non-expert data.
  • Figure 3: Visualization of the robotic tasks: (a-d) Real world environments with a UR5 arm. (e-j) Simulation environments for evaluation using the Meta-World yu2020meta benchmark.
  • Figure 4: Visualization of policy learning in the Meta-World yu2020meta simulation environment. We run Progressor and several baselines on six diverse tasks of various difficulties. We also run Progressor without online push-back as an ablation. We report the environment reward during training (left) and the task success rate from 10 rollouts (right) averaged over five seeds. The solid line denotes the mean and the transparent area denotes standard deviation. Progressor demonstrates clear advantages in both metrics, especially at early stages of training.
  • Figure 5: Success rates for four real-world tasks, where RWR-ACT is trained on a combination of correct and failed demonstrations using Progressor, R3M, and VIP as reward models.
  • ...and 8 more figures