Table of Contents
Fetching ...

Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons

Anthony Liang, Yigit Korkmaz, Jiahui Zhang, Minyoung Hwang, Abrar Anwar, Sidhant Kaushik, Aditya Shah, Alex S. Huang, Luke Zettlemoyer, Dieter Fox, Yu Xiang, Anqi Li, Andreea Bobu, Abhishek Gupta, Stephen Tu, Erdem Biyik, Jesse Zhang

TL;DR

Robometer is introduced, a scalable reward modeling framework that combines intra-trajectory progress supervision with inter-trajectory preference supervision, and learns more generalizable reward functions than prior methods and improves robot learning performance across a diverse set of downstream applications.

Abstract

General-purpose robot reward models are typically trained to predict absolute task progress from expert demonstrations, providing only local, frame-level supervision. While effective for expert demonstrations, this paradigm scales poorly to large-scale robotics datasets where failed and suboptimal trajectories are abundant and assigning dense progress labels is ambiguous. We introduce Robometer, a scalable reward modeling framework that combines intra-trajectory progress supervision with inter-trajectory preference supervision. Robometer is trained with a dual objective: a frame-level progress loss that anchors reward magnitude on expert data, and a trajectory-comparison preference loss that imposes global ordering constraints across trajectories of the same task, enabling effective learning from both real and augmented failed trajectories. To support this formulation at scale, we curate RBM-1M, a reward-learning dataset comprising over one million trajectories spanning diverse robot embodiments and tasks, including substantial suboptimal and failure data. Across benchmarks and real-world evaluations, Robometer learns more generalizable reward functions than prior methods and improves robot learning performance across a diverse set of downstream applications. Code, model weights, and videos at https://robometer.github.io/.

Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons

TL;DR

Robometer is introduced, a scalable reward modeling framework that combines intra-trajectory progress supervision with inter-trajectory preference supervision, and learns more generalizable reward functions than prior methods and improves robot learning performance across a diverse set of downstream applications.

Abstract

General-purpose robot reward models are typically trained to predict absolute task progress from expert demonstrations, providing only local, frame-level supervision. While effective for expert demonstrations, this paradigm scales poorly to large-scale robotics datasets where failed and suboptimal trajectories are abundant and assigning dense progress labels is ambiguous. We introduce Robometer, a scalable reward modeling framework that combines intra-trajectory progress supervision with inter-trajectory preference supervision. Robometer is trained with a dual objective: a frame-level progress loss that anchors reward magnitude on expert data, and a trajectory-comparison preference loss that imposes global ordering constraints across trajectories of the same task, enabling effective learning from both real and augmented failed trajectories. To support this formulation at scale, we curate RBM-1M, a reward-learning dataset comprising over one million trajectories spanning diverse robot embodiments and tasks, including substantial suboptimal and failure data. Across benchmarks and real-world evaluations, Robometer learns more generalizable reward functions than prior methods and improves robot learning performance across a diverse set of downstream applications. Code, model weights, and videos at https://robometer.github.io/.
Paper Structure (86 sections, 5 equations, 16 figures, 26 tables)

This paper contains 86 sections, 5 equations, 16 figures, 26 tables.

Figures (16)

  • Figure 2: Robometer is a VLM-based reward model, that predicts dense, per-frame progress-based rewards and success labels for the first of two video trajectories. To be able to train with failed, non-expert data, we also predict which of the two video trajectories better completes the task. We use three strategies for curating training examples from our given datasets, which are further detailed in Section \ref{['sec:augmentation']} with model architecture shown in Appendix \ref{['fig:architecture']}.
  • Figure 3: Video-Language Reward Confusion Matrix. For each task sampled at random from self-collected, unseen data from RBM-EVAL-OOD, we compute rewards for all combinations of demonstration videos and language descriptions. Robometer produces the most diagonal-heavy confusion matrix, indicating strong alignment between unseen demos and instructions. We also report the column-normalized diagonal mean under each model, which represents the fraction of the model’s total reward for aligned task and video pairs.
  • Figure 4: Qualitative Analysis of Failure, Suboptimal and Successful Trajectories. We visualize the progress predictions for three trajectories of different quality for the same task. Notably, for the suboptimal trajectory, Robometer predicts steadily increasing progress as the robot approaches the pen holder, but sharply reduces its progress estimate when the marker is dropped, correctly reflecting regression in task completion. In contrast, RoboReward continues to assign high progress despite the task failure. Finally, Robometer is the only model that correctly predicts task success for the successful trajectory (i.e., high final progress value and explicit success prediction).
  • Figure 5: RL w/ Ablation Models in LIBERO-90 tasks from scratch, corresponding to ablations trained only on LIBERO-10/Object/Goal/Spatial data from \ref{['tab:exp:ablations']}. We report the average success rate $\pm$ standard deviation across 5 seeds.
  • Figure 6: Automatic online RL with DSRL on a DROID setup with Robometer improves $\pi_0$ from 20% to 85% on a single-stage task and 20% to 70% on a two-stage task, outperforming RoboReward's overall success rate by $2.5\times$. DSRL with Robometer learns to avoid base $\pi_0$ errors such as collisions or moving the wrong object. The setup is deemed "automatic" because success detection and stage advancement are handled automatically by the reward model, requiring human intervention only for physical scene resets.
  • ...and 11 more figures