Table of Contents
Fetching ...

ARM: Advantage Reward Modeling for Long-Horizon Manipulation

Yiming Mao, Zixi Yu, Weixin Mao, Yinhao Li, Qirui Hu, Zihan Lan, Minzhao Zhu, Hua Chen

Abstract

Long-horizon robotic manipulation remains challenging for reinforcement learning (RL) because sparse rewards provide limited guidance for credit assignment. Practical policy improvement thus relies on richer intermediate supervision, such as dense progress rewards, which are costly to obtain and ill-suited to non-monotonic behaviors such as backtracking and recovery. To address this, we propose Advantage Reward Modeling (ARM), a framework that shifts from hard-to-quantify absolute progress to estimating relative advantage. We introduce a cost-effective tri-state labeling strategy -- Progressive, Regressive, and Stagnant -- that reduces human cognitive overhead while ensuring high cross-annotator consistency. By training on these intuitive signals, ARM enables automated progress annotation for both complete demonstrations and fragmented DAgger-style data. Integrating ARM into an offline RL pipeline allows for adaptive action-reward reweighting, effectively filtering suboptimal samples. Our approach achieves a 99.4% success rate on a challenging long-horizon towel-folding task, demonstrating improved stability and data efficiency over current VLA baselines with near-zero human intervention during policy training.

ARM: Advantage Reward Modeling for Long-Horizon Manipulation

Abstract

Long-horizon robotic manipulation remains challenging for reinforcement learning (RL) because sparse rewards provide limited guidance for credit assignment. Practical policy improvement thus relies on richer intermediate supervision, such as dense progress rewards, which are costly to obtain and ill-suited to non-monotonic behaviors such as backtracking and recovery. To address this, we propose Advantage Reward Modeling (ARM), a framework that shifts from hard-to-quantify absolute progress to estimating relative advantage. We introduce a cost-effective tri-state labeling strategy -- Progressive, Regressive, and Stagnant -- that reduces human cognitive overhead while ensuring high cross-annotator consistency. By training on these intuitive signals, ARM enables automated progress annotation for both complete demonstrations and fragmented DAgger-style data. Integrating ARM into an offline RL pipeline allows for adaptive action-reward reweighting, effectively filtering suboptimal samples. Our approach achieves a 99.4% success rate on a challenging long-horizon towel-folding task, demonstrating improved stability and data efficiency over current VLA baselines with near-zero human intervention during policy training.

Paper Structure

This paper contains 40 sections, 6 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Overview of our proposed framework. The system consists of three main components: (1) The Advantage Reward Model (ARM) with its MIMO-based Temporal Transformer, supervised by a lightweight tri-state labeling strategy; (2) An automated pipeline for global progress reconstruction; and (3) The Advantage-Weighted Behavior Cloning (AW-BC) algorithm, which optimizes the policy using length-invariant relative gains extracted from the reconstructed progress.
  • Figure 2: Comparison between MISO and MIMO architectures. MISO stands for Multi-Input Single-Output, and MIMO stands for Multi-Input Multi-Output.
  • Figure 3: Illustration of the tri-state labeling strategy applied to a demonstration episode.
  • Figure 4: Overview of the long-horizon towel-folding task. The sequence includes extracting a towel from clutter, placing and flattening it on the table, executing a precise multi-stage folding strategy, and transporting the folded towel into the target box.
  • Figure 5: Qualitative comparison of progress reconstruction. We visualize the progress curves of SARM and ARM against the Ground Truth (GT) for a representative episode. While SARM struggles with non-monotonic behaviors, ARM reconstructs a smooth, high-fidelity curve that closely tracks the GT, even during regressive adjustments.
  • ...and 3 more figures