Table of Contents
Fetching ...

From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation

Yibin Liu, Yaxing Lyu, Daqi Gao, Zhixuan Liang, Weiliang Tang, Shilong Mu, Xiaokang Yang, Yao Mu

Abstract

Accurate process supervision remains a critical challenge for long-horizon robotic manipulation. A primary bottleneck is that current video MLLMs, trained primarily under a Supervised Fine-Tuning (SFT) paradigm, function as passive "Observers" that recognize ongoing events rather than evaluating the current state relative to the final task goal. In this paper, we introduce PRIMO R1 (Process Reasoning Induced Monitoring), a 7B framework that transforms video MLLMs into active "Critics". We leverage outcome-based Reinforcement Learning to incentivize explicit Chain-of-Thought generation for progress estimation. Furthermore, our architecture constructs a structured temporal input by explicitly anchoring the video sequence between initial and current state images. Supported by the proposed PRIMO Dataset and Benchmark, extensive experiments across diverse in-domain environments and out-of-domain real-world humanoid scenarios demonstrate that PRIMO R1 achieves state-of-the-art performance. Quantitatively, our 7B model achieves a 50% reduction in the mean absolute error of specialized reasoning baselines, demonstrating significant relative accuracy improvements over 72B-scale general MLLMs. Furthermore, PRIMO R1 exhibits strong zero-shot generalization on difficult failure detection tasks. We establish state-of-the-art performance on RoboFail benchmark with 67.0% accuracy, surpassing closed-source models like OpenAI o1 by 6.0%.

From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation

Abstract

Accurate process supervision remains a critical challenge for long-horizon robotic manipulation. A primary bottleneck is that current video MLLMs, trained primarily under a Supervised Fine-Tuning (SFT) paradigm, function as passive "Observers" that recognize ongoing events rather than evaluating the current state relative to the final task goal. In this paper, we introduce PRIMO R1 (Process Reasoning Induced Monitoring), a 7B framework that transforms video MLLMs into active "Critics". We leverage outcome-based Reinforcement Learning to incentivize explicit Chain-of-Thought generation for progress estimation. Furthermore, our architecture constructs a structured temporal input by explicitly anchoring the video sequence between initial and current state images. Supported by the proposed PRIMO Dataset and Benchmark, extensive experiments across diverse in-domain environments and out-of-domain real-world humanoid scenarios demonstrate that PRIMO R1 achieves state-of-the-art performance. Quantitatively, our 7B model achieves a 50% reduction in the mean absolute error of specialized reasoning baselines, demonstrating significant relative accuracy improvements over 72B-scale general MLLMs. Furthermore, PRIMO R1 exhibits strong zero-shot generalization on difficult failure detection tasks. We establish state-of-the-art performance on RoboFail benchmark with 67.0% accuracy, surpassing closed-source models like OpenAI o1 by 6.0%.
Paper Structure (28 sections, 5 equations, 10 figures, 9 tables)

This paper contains 28 sections, 5 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Paradigm comparison: Prior approaches vs. our PRIMO R1.
  • Figure 2: Overall framework of PRIMO R1. Evaluated across in-domain simulations (AgiBot, BEHAVIOR, RoboTwin) and OOD real humanoid environments, the model processes a video sequence ($V_{seq}$) anchored by initial ($I_{init}$) and current ($I_{curr}$) states. It generates an explicit Chain-of-Thought to output the final progress estimate.
  • Figure 3: Dataset distribution for SFT (left), RL (middle), and PRIMO Bench (right). Charts show sample counts and domain percentages (e.g., BEHAVIOR, AgiBot, RoboTwin). The PRIMO Bench highlights the data split between In-Domain and Out-of-Domain evaluation sets. See Appendix \ref{['sec:appendix_datasets']} for details.
  • Figure 4: Fine-Grained Error Analysis Across Task Progress Intervals. MAE evaluation across five completion stages in four environments ((a)-(d)). Compared to baselines, our RL-finetuned model (Ours_SFT_RL) maintains lower error rates, particularly mitigating severe hallucinations in the final execution stage ($80{-}100\%$).
  • Figure 5: Continuous Progress Estimation. Average predicted progress trajectory over 105 episodes for the "Clear the countertop waste" task, comparing temporal state alignment against baselines.
  • ...and 5 more figures