TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics

Shirui Chen; Cole Harrison; Ying-Chun Lee; Angela Jin Yang; Zhongzheng Ren; Lillian J. Ratliff; Jiafei Duan; Dieter Fox; Ranjay Krishna

TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics

Shirui Chen, Cole Harrison, Ying-Chun Lee, Angela Jin Yang, Zhongzheng Ren, Lillian J. Ratliff, Jiafei Duan, Dieter Fox, Ranjay Krishna

TL;DR

T TOPReward is introduced, a novel, probabilistically grounded temporal value function that leverages the latent world knowledge of pretrained video Vision-Language Models (VLMs) to estimate robotic task progress and serves as a versatile tool for downstream applications, including success detection and reward-aligned behavior cloning.

Abstract

While Vision-Language-Action (VLA) models have seen rapid progress in pretraining, their advancement in Reinforcement Learning (RL) remains hampered by low sample efficiency and sparse rewards in real-world settings. Developing generalizable process reward models is essential for providing the fine-grained feedback necessary to bridge this gap, yet existing temporal value functions often fail to generalize beyond their training domains. We introduce TOPReward, a novel, probabilistically grounded temporal value function that leverages the latent world knowledge of pretrained video Vision-Language Models (VLMs) to estimate robotic task progress. Unlike prior methods that prompt VLMs to directly output progress values, which are prone to numerical misrepresentation, TOPReward extracts task progress directly from the VLM's internal token logits. In zero-shot evaluations across 130+ distinct real-world tasks and multiple robot platforms (e.g., Franka, YAM, SO-100/101), TOPReward achieves 0.947 mean Value-Order Correlation (VOC) on Qwen3-VL, dramatically outperforming the state-of-the-art GVL baseline which achieves near-zero correlation on the same open-source model. We further demonstrate that TOPReward serves as a versatile tool for downstream applications, including success detection and reward-aligned behavior cloning.

TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics

TL;DR

Abstract

Paper Structure (23 sections, 6 equations, 17 figures, 7 tables)

This paper contains 23 sections, 6 equations, 17 figures, 7 tables.

Introduction
Related Work
TOPReward
Token probability as the reward
Chat templates.
Progress estimation from trajectory prefixes
Mani Reward Bench: a benchmark for reward modeling in robotic manipulation
Experiments
Large-scale real-world evaluation
Success detection
Real-world advantage-weighted behavior cloning
Ablation
Conclusion
Impact Statements
Alternative Reward Formulation
...and 8 more sections

Figures (17)

Figure 1: Result highlights. $\texttt{TOPReward}$ enables effective zero-shot estimation of task progress across diverse and challenging real-world manipulation tasks, without task-specific training. By bootstrapping on a range of vision–language model backbones, $\texttt{TOPReward}$ provides a temporally consistent visual reward signal that supports multiple downstream applications, including success detection, policy improvement, and evaluation on our in-house benchmark, $\texttt{Mani Reward Bench}$.
Figure 2: Qualitative example of "Fold the Towel": Instruction-conditioned progress estimation on a real trajectory. The curve shows TOPReward's predicted completion value over time, with annotated values at selected frames corresponding to semantic subtasks.
Figure 3: VOC comparison across datasets. Mean dataset-level VOC for GVL (0-shot) and TOPReward across two evaluation sets: OXE (39 datasets, 20 episodes each) and Mani Reward Bench (4 datasets, 113 tasks, 497 episodes). Error bars denote standard deviation across datasets within each evaluation set.
Figure 4: Progress traces for ManiRewardBench. Example progress traces predicted by TOPReward (orange) compared to stage-aware ground-truth completion (dashed) from Mani Reward Bench, computed from annotated subtask boundaries. We also overlay Gemini-GVL (blue) on the same episodes when available.
Figure 5: Illustrative example of the VOC failure mode. Because VOC depends only on the rank order of predicted values (not the absolute completion level), trajectories that rise and then plateau at different final completion levels can all score highly ($\geq 0.85$). As a result, VOC may not distinguish a well-ordered but incomplete (early-plateau) trajectory from a complete trajectory.
...and 12 more figures

TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics

TL;DR

Abstract

TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics

Authors

TL;DR

Abstract

Table of Contents

Figures (17)