Table of Contents
Fetching ...

TimeRewarder: Learning Dense Reward from Passive Videos via Frame-wise Temporal Distance

Yuyang Liu, Chuan Wen, Yihang Hu, Dinesh Jayaraman, Yang Gao

TL;DR

TimeRewarder tackles the challenge of sparse, hard-to-design rewards in robotic RL by learning dense progress signals from action-free videos. It frames task progress as frame-wise temporal distance and trains a progress model F_\\theta to predict normalized distances between frames, discretized via a two-hot scheme and learned with cross-entropy. During RL, the model furnishes dense step-wise rewards r_{TR} in [-1,1], augmented with a sparse environment success signal to guide policy optimization, and is shown to outperform baselines and even manually designed dense rewards on 9 of 10 Meta-World tasks, with strong sample efficiency. The approach also demonstrates cross-domain transfer by leveraging real-world human videos, highlighting a scalable path to rich reward signals from diverse video sources for imitation-from-observation settings.

Abstract

Designing dense rewards is crucial for reinforcement learning (RL), yet in robotics it often demands extensive manual effort and lacks scalability. One promising solution is to view task progress as a dense reward signal, as it quantifies the degree to which actions advance the system toward task completion over time. We present TimeRewarder, a simple yet effective reward learning method that derives progress estimation signals from passive videos, including robot demonstrations and human videos, by modeling temporal distances between frame pairs. We then demonstrate how TimeRewarder can supply step-wise proxy rewards to guide reinforcement learning. In our comprehensive experiments on ten challenging Meta-World tasks, we show that TimeRewarder dramatically improves RL for sparse-reward tasks, achieving nearly perfect success in 9/10 tasks with only 200,000 interactions per task with the environment. This approach outperformed previous methods and even the manually designed environment dense reward on both the final success rate and sample efficiency. Moreover, we show that TimeRewarder pretraining can exploit real-world human videos, highlighting its potential as a scalable approach path to rich reward signals from diverse video sources.

TimeRewarder: Learning Dense Reward from Passive Videos via Frame-wise Temporal Distance

TL;DR

TimeRewarder tackles the challenge of sparse, hard-to-design rewards in robotic RL by learning dense progress signals from action-free videos. It frames task progress as frame-wise temporal distance and trains a progress model F_\\theta to predict normalized distances between frames, discretized via a two-hot scheme and learned with cross-entropy. During RL, the model furnishes dense step-wise rewards r_{TR} in [-1,1], augmented with a sparse environment success signal to guide policy optimization, and is shown to outperform baselines and even manually designed dense rewards on 9 of 10 Meta-World tasks, with strong sample efficiency. The approach also demonstrates cross-domain transfer by leveraging real-world human videos, highlighting a scalable path to rich reward signals from diverse video sources for imitation-from-observation settings.

Abstract

Designing dense rewards is crucial for reinforcement learning (RL), yet in robotics it often demands extensive manual effort and lacks scalability. One promising solution is to view task progress as a dense reward signal, as it quantifies the degree to which actions advance the system toward task completion over time. We present TimeRewarder, a simple yet effective reward learning method that derives progress estimation signals from passive videos, including robot demonstrations and human videos, by modeling temporal distances between frame pairs. We then demonstrate how TimeRewarder can supply step-wise proxy rewards to guide reinforcement learning. In our comprehensive experiments on ten challenging Meta-World tasks, we show that TimeRewarder dramatically improves RL for sparse-reward tasks, achieving nearly perfect success in 9/10 tasks with only 200,000 interactions per task with the environment. This approach outperformed previous methods and even the manually designed environment dense reward on both the final success rate and sample efficiency. Moreover, we show that TimeRewarder pretraining can exploit real-world human videos, highlighting its potential as a scalable approach path to rich reward signals from diverse video sources.

Paper Structure

This paper contains 24 sections, 9 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Overview of TimeRewarder. Mirroring how humans infer task progression by observing others, TimeRewarder distills frame-wise temporal distances from expert videos and converts them into dense reward signals, thereby enabling reinforcement learning free of manually engineered rewards or action annotations.
  • Figure 2: TimeRewarder framework. TimeRewarder learns step-wise dense rewards from passive videos by modeling intrinsic temporal distances, enabling robust progress scoring that assigns high values to states reflecting task advancement, while penalizing suboptimal actions lacking meaningful contribution to task progression, thereby facilitating effective policy learning.
  • Figure 3: Value–Order Correlation (VOC) on held-out expert videos. Higher is better.
  • Figure 4: Reward/value curves on successful (traj1) vs. failed (traj2) rollouts for two tasks. TimeRewarder and VIP output values (cumulative progress), PROGRESSOR outputs stepwise rewards, while Rank2Reward is visualized through its pairwise ordering reward signals. TimeRewarder provides the most informative and temporally coherent feedback.
  • Figure 5: Performance of reinforcement learning with sparse environment success signals and dense proxy rewards from each method. Curves show mean $\pm$ s.d. over eight seeds. Dashed lines indicate reference settings of behavior cloning (BC) and environment dense reward supervision.
  • ...and 7 more figures