Table of Contents
Fetching ...

PRM-as-a-Judge: A Dense Evaluation Paradigm for Fine-Grained Robotic Auditing

Yuheng Ji, Yuyang Liu, Huajie Tan, Xuchuan Huang, Fanding Huang, Yijie Xu, Cheng Chi, Yuting Zhao, Huaihai Lyu, Peterson Co, Mingyu Cao, Qiongyu Zhang, Zhe Li, Enshen Zhou, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang, Xiaolong Zheng

Abstract

Current robotic evaluation is still largely dominated by binary success rates, which collapse rich execution processes into a single outcome and obscure critical qualities such as progress, efficiency, and stability. To address this limitation, we propose PRM-as-a-Judge, a dense evaluation paradigm that leverages Process Reward Models (PRMs) to audit policy execution directly from trajectory videos by estimating task progress from observation sequences. Central to this paradigm is the OPD (Outcome-Process-Diagnosis) metric system, which explicitly formalizes execution quality via a task-aligned progress potential. We characterize dense robotic evaluation through two axiomatic properties: macro-consistency, which requires additive and path-consistent aggregation, and micro-resolution, which requires sensitivity to fine-grained physical evolution. Under this formulation, potential-based PRM judges provide a natural instantiation of dense evaluation, with macro-consistency following directly from the induced scalar potential. We empirically validate the micro-resolution property using RoboPulse, a diagnostic benchmark specifically designed for probing micro-scale progress discrimination, where several trajectory-trained PRM judges outperform discriminative similarity-based methods and general-purpose foundation-model judges. Finally, leveraging PRM-as-a-Judge and the OPD metric system, we conduct a structured audit of mainstream policy paradigms across long-horizon tasks, revealing behavioral signatures and failure modes that are invisible to outcome-only metrics.

PRM-as-a-Judge: A Dense Evaluation Paradigm for Fine-Grained Robotic Auditing

Abstract

Current robotic evaluation is still largely dominated by binary success rates, which collapse rich execution processes into a single outcome and obscure critical qualities such as progress, efficiency, and stability. To address this limitation, we propose PRM-as-a-Judge, a dense evaluation paradigm that leverages Process Reward Models (PRMs) to audit policy execution directly from trajectory videos by estimating task progress from observation sequences. Central to this paradigm is the OPD (Outcome-Process-Diagnosis) metric system, which explicitly formalizes execution quality via a task-aligned progress potential. We characterize dense robotic evaluation through two axiomatic properties: macro-consistency, which requires additive and path-consistent aggregation, and micro-resolution, which requires sensitivity to fine-grained physical evolution. Under this formulation, potential-based PRM judges provide a natural instantiation of dense evaluation, with macro-consistency following directly from the induced scalar potential. We empirically validate the micro-resolution property using RoboPulse, a diagnostic benchmark specifically designed for probing micro-scale progress discrimination, where several trajectory-trained PRM judges outperform discriminative similarity-based methods and general-purpose foundation-model judges. Finally, leveraging PRM-as-a-Judge and the OPD metric system, we conduct a structured audit of mainstream policy paradigms across long-horizon tasks, revealing behavioral signatures and failure modes that are invisible to outcome-only metrics.
Paper Structure (108 sections, 23 theorems, 50 equations, 19 figures, 13 tables)

This paper contains 108 sections, 23 theorems, 50 equations, 19 figures, 13 tables.

Key Result

Lemma 1.1

For any real sequence $(\Phi_t)_{t=0}^T$,

Figures (19)

  • Figure 1: PRM-as-a-Judge and OPD. Binary success rate compresses an execution into a terminal outcome and obscures progress, efficiency, and stability. We use a Process Reward Model to induce a dense progress potential and derive OPD metrics at the outcome, process, and diagnosis levels. This yields fine-grained auditing that distinguishes near miss failures from early collapse and separates smooth successes from inefficient or unstable ones.
  • Figure 2: RoboPulse overview.Left: data composition across collection settings and embodiment--setting categories. Right: task semantic coverage via a token word cloud extracted from task names. RoboPulse is designed to probe micro-scale progress discrimination under diverse embodiments and visual domains.
  • Figure 3: Reachability and failure-stage decomposition by milestone coverage. For each task, we plot the fraction of rollout episodes that reach milestone thresholds (25/50/75/100%), revealing where execution progress tends to stall along the horizon.
  • Figure 4: Success-only execution quality on Handover Mic. We report success-conditioned OPD metrics and compare path efficiency, measured by PPL, against regression, measured by CRA, and stagnation, measured by STR, across policy families. Error bars denote standard deviation across successful episodes.
  • Figure 5: Failure-only OPD fingerprints. We normalize OPD metrics over failed episodes within each task to highlight policy-specific failure patterns.
  • ...and 14 more figures

Theorems & Definitions (50)

  • Lemma 1.1: Variation dominates endpoint displacement
  • proof
  • Lemma 1.2: Range and monotonicity
  • proof
  • Lemma 1.3: Stability under bounded judge error
  • proof
  • Lemma 1.4: Range and refinement monotonicity
  • proof
  • Theorem 1.5: Range and tight characterization
  • proof
  • ...and 40 more