SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation

Qianzhong Chen; Justin Yu; Mac Schwager; Pieter Abbeel; Yide Shentu; Philipp Wu

SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation

Qianzhong Chen, Justin Yu, Mac Schwager, Pieter Abbeel, Yide Shentu, Philipp Wu

TL;DR

This work tackles long-horizon, contact-rich manipulation of deformable objects by introducing Stage-Aware Reward Modeling (SARM), a video-based framework that derives progress signals from natural language subtasks. SARM uses a dual-head architecture to predict high-level stages and fine-grained subtask progress, enabling stable rewards across variable-length demonstrations. Coupled with Reward-Aligned Behavior Cloning (RA-BC), which reweights and filters data based on progress signals, the approach achieves substantial performance gains on T-shirt folding and demonstrates robustness to dataset diversity and real-world rollouts. The results highlight reward modeling as a key enabler for scalable, annotation-efficient imitation learning in challenging long-horizon manipulation tasks, with significant practical impact for domestic robotics and beyond.

Abstract

Large-scale robot learning has recently shown promise for enabling robots to perform complex tasks by integrating perception, control, and language understanding. Yet, it struggles with long-horizon, contact-rich manipulation such as deformable object handling, where demonstration quality is inconsistent. Reward modeling offers a natural solution: by providing grounded progress signals, it transforms noisy demonstrations into stable supervision that generalizes across diverse trajectories. We introduce a stage-aware, video-based reward modeling framework that jointly predicts high-level task stages and fine-grained progress. Reward labels are automatically derived from natural language subtask annotations, ensuring consistent progress estimation across variable-length demonstrations. This design overcomes frame-index labeling, which fails in variable-duration tasks like folding a T-shirt. Our reward model demonstrates robustness to variability, generalization to out-of-distribution settings, and strong utility for policy training. Building on it, we propose Reward-Aligned Behavior Cloning (RA-BC), which filters high-quality data and reweights samples by reward. Experiments show the reward model alone outperforms baselines on validation and real robot rollouts. Integrated into RA-BC, our approach achieves 83% success on folding T-shirts from the flattened state and 67% from the crumpled state -- far surpassing vanilla behavior cloning, which attains only 8% and 0% success. Overall, our results highlight reward modeling as a key enabler for scalable, annotation-efficient, and robust imitation learning in long-horizon manipulation.

SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation

TL;DR

Abstract

SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (18)