Table of Contents
Fetching ...

SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation

Qianzhong Chen, Justin Yu, Mac Schwager, Pieter Abbeel, Yide Shentu, Philipp Wu

TL;DR

This work tackles long-horizon, contact-rich manipulation of deformable objects by introducing Stage-Aware Reward Modeling (SARM), a video-based framework that derives progress signals from natural language subtasks. SARM uses a dual-head architecture to predict high-level stages and fine-grained subtask progress, enabling stable rewards across variable-length demonstrations. Coupled with Reward-Aligned Behavior Cloning (RA-BC), which reweights and filters data based on progress signals, the approach achieves substantial performance gains on T-shirt folding and demonstrates robustness to dataset diversity and real-world rollouts. The results highlight reward modeling as a key enabler for scalable, annotation-efficient imitation learning in challenging long-horizon manipulation tasks, with significant practical impact for domestic robotics and beyond.

Abstract

Large-scale robot learning has recently shown promise for enabling robots to perform complex tasks by integrating perception, control, and language understanding. Yet, it struggles with long-horizon, contact-rich manipulation such as deformable object handling, where demonstration quality is inconsistent. Reward modeling offers a natural solution: by providing grounded progress signals, it transforms noisy demonstrations into stable supervision that generalizes across diverse trajectories. We introduce a stage-aware, video-based reward modeling framework that jointly predicts high-level task stages and fine-grained progress. Reward labels are automatically derived from natural language subtask annotations, ensuring consistent progress estimation across variable-length demonstrations. This design overcomes frame-index labeling, which fails in variable-duration tasks like folding a T-shirt. Our reward model demonstrates robustness to variability, generalization to out-of-distribution settings, and strong utility for policy training. Building on it, we propose Reward-Aligned Behavior Cloning (RA-BC), which filters high-quality data and reweights samples by reward. Experiments show the reward model alone outperforms baselines on validation and real robot rollouts. Integrated into RA-BC, our approach achieves 83% success on folding T-shirts from the flattened state and 67% from the crumpled state -- far surpassing vanilla behavior cloning, which attains only 8% and 0% success. Overall, our results highlight reward modeling as a key enabler for scalable, annotation-efficient, and robust imitation learning in long-horizon manipulation.

SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation

TL;DR

This work tackles long-horizon, contact-rich manipulation of deformable objects by introducing Stage-Aware Reward Modeling (SARM), a video-based framework that derives progress signals from natural language subtasks. SARM uses a dual-head architecture to predict high-level stages and fine-grained subtask progress, enabling stable rewards across variable-length demonstrations. Coupled with Reward-Aligned Behavior Cloning (RA-BC), which reweights and filters data based on progress signals, the approach achieves substantial performance gains on T-shirt folding and demonstrates robustness to dataset diversity and real-world rollouts. The results highlight reward modeling as a key enabler for scalable, annotation-efficient imitation learning in challenging long-horizon manipulation tasks, with significant practical impact for domestic robotics and beyond.

Abstract

Large-scale robot learning has recently shown promise for enabling robots to perform complex tasks by integrating perception, control, and language understanding. Yet, it struggles with long-horizon, contact-rich manipulation such as deformable object handling, where demonstration quality is inconsistent. Reward modeling offers a natural solution: by providing grounded progress signals, it transforms noisy demonstrations into stable supervision that generalizes across diverse trajectories. We introduce a stage-aware, video-based reward modeling framework that jointly predicts high-level task stages and fine-grained progress. Reward labels are automatically derived from natural language subtask annotations, ensuring consistent progress estimation across variable-length demonstrations. This design overcomes frame-index labeling, which fails in variable-duration tasks like folding a T-shirt. Our reward model demonstrates robustness to variability, generalization to out-of-distribution settings, and strong utility for policy training. Building on it, we propose Reward-Aligned Behavior Cloning (RA-BC), which filters high-quality data and reweights samples by reward. Experiments show the reward model alone outperforms baselines on validation and real robot rollouts. Integrated into RA-BC, our approach achieves 83% success on folding T-shirts from the flattened state and 67% from the crumpled state -- far surpassing vanilla behavior cloning, which attains only 8% and 0% success. Overall, our results highlight reward modeling as a key enabler for scalable, annotation-efficient, and robust imitation learning in long-horizon manipulation.

Paper Structure

This paper contains 43 sections, 11 equations, 18 figures, 10 tables.

Figures (18)

  • Figure 1: Overview of our method's framework for (a) data processing, (b) reward model training, and (c) policy training with reward signals. $\mathcal{D}_{\text{anno}}$ denotes the annotated dataset used for training the reward model, with examples shown in Fig. \ref{['fig:demo_sparse']} and Fig. \ref{['fig:demo_dense']}. $\mathcal{D}_{\text{diverse}}$ refers to a diverse expert dataset without annotations, which contains many suboptimal trajectories.
  • Figure 2: Overview of SARM, stage-aware reward modeling. Left: SARM overview, which includes both a stage estimator and subtask estimator. First the task stage is predicted from the observations. This prediction is additionally passed into the subtask estimator which predicts a scale value of the progress within the stage. Right: An overview of the estimator architecture which is replicated for both the stage estimator and the subtask estimator.
  • Figure 3: A visualization of the predicted task progress for T-shirt folding demonstrations. Compared with ReWiND, SARM provides more accurate and calibrated estimates.
  • Figure 4: The physical station used for data collection and policy evaluation.
  • Figure 5: Expert demonstration with sparse annotation.
  • ...and 13 more figures