Table of Contents
Fetching ...

Multi-Stage Manipulation with Demonstration-Augmented Reward, Policy, and World Model Learning

Adrià López Escoriza, Nicklas Hansen, Stone Tao, Tongzhou Mu, Hao Su

TL;DR

The paper addresses learning long-horizon robotic manipulation with sparse rewards by exploiting a multi-stage task structure with N stages. It introduces DEMO3, a model-based RL framework that learns a policy, a world model, and a dense stage-aware reward online from a small set of demonstrations. Dense stage rewards are produced by online discriminators over latent representations, enabling frequent, informative feedback integrated into the world-model objective. Evaluations across 16 tasks in 4 domains show about 40% improvement in data-efficiency on average and up to 70% on the hardest tasks, using as few as five demonstrations for humanoid visual control, indicating strong robustness and practicality.

Abstract

Long-horizon tasks in robotic manipulation present significant challenges in reinforcement learning (RL) due to the difficulty of designing dense reward functions and effectively exploring the expansive state-action space. However, despite a lack of dense rewards, these tasks often have a multi-stage structure, which can be leveraged to decompose the overall objective into manageable subgoals. In this work, we propose DEMO3, a framework that exploits this structure for efficient learning from visual inputs. Specifically, our approach incorporates multi-stage dense reward learning, a bi-phasic training scheme, and world model learning into a carefully designed demonstration-augmented RL framework that strongly mitigates the challenge of exploration in long-horizon tasks. Our evaluations demonstrate that our method improves data-efficiency by an average of 40% and by 70% on particularly difficult tasks compared to state-of-the-art approaches. We validate this across 16 sparse-reward tasks spanning four domains, including challenging humanoid visual control tasks using as few as five demonstrations.

Multi-Stage Manipulation with Demonstration-Augmented Reward, Policy, and World Model Learning

TL;DR

The paper addresses learning long-horizon robotic manipulation with sparse rewards by exploiting a multi-stage task structure with N stages. It introduces DEMO3, a model-based RL framework that learns a policy, a world model, and a dense stage-aware reward online from a small set of demonstrations. Dense stage rewards are produced by online discriminators over latent representations, enabling frequent, informative feedback integrated into the world-model objective. Evaluations across 16 tasks in 4 domains show about 40% improvement in data-efficiency on average and up to 70% on the hardest tasks, using as few as five demonstrations for humanoid visual control, indicating strong robustness and practicality.

Abstract

Long-horizon tasks in robotic manipulation present significant challenges in reinforcement learning (RL) due to the difficulty of designing dense reward functions and effectively exploring the expansive state-action space. However, despite a lack of dense rewards, these tasks often have a multi-stage structure, which can be leveraged to decompose the overall objective into manageable subgoals. In this work, we propose DEMO3, a framework that exploits this structure for efficient learning from visual inputs. Specifically, our approach incorporates multi-stage dense reward learning, a bi-phasic training scheme, and world model learning into a carefully designed demonstration-augmented RL framework that strongly mitigates the challenge of exploration in long-horizon tasks. Our evaluations demonstrate that our method improves data-efficiency by an average of 40% and by 70% on particularly difficult tasks compared to state-of-the-art approaches. We validate this across 16 sparse-reward tasks spanning four domains, including challenging humanoid visual control tasks using as few as five demonstrations.

Paper Structure

This paper contains 37 sections, 6 equations, 17 figures, 10 tables, 1 algorithm.

Figures (17)

  • Figure 1: Summary of results. Final success rate ($\%$) achieved by our method and a set of strong baselines, averaged across all tasks within each of 4 domains. Average of 5 seeds. Given a handful of demonstrations, our method achieves high success rates in challenging visual manipulation tasks with sparse rewards, far exceeding previous state-of-the-art methods. See Appendix \ref{['app:additional_results']} for per-task results.
  • Figure 2: Task domains. We evaluate methods on $\mathbf{16}$ multi-stage image-based sparse-reward tasks spanning four domains: Meta-World yu2021metaworldbenchmarkevaluationmultitask, Robosuite zhu2022robosuitemodularsimulationframework, as well as manipulation and humanoid tasks from ManiSkill3 tao2024maniskill3gpuparallelizedrobotics. See Appendix \ref{['app:environments']} for a complete overview of tasks.
  • Figure 3: Method overview. We present a two-phase framework for multi-stage visual manipulation from sparse rewards that leverages a handful of demonstrations for dense reward learning and MBRL. Phase 1 (left): policy and encoder is pre-trained on the available demonstrations using behavioral cloning, which serve as initialization for the next phase. Phase 2 (right): the agent iteratively collects environment data via planning and uses all available data to update its world model as well as a latent state discriminator; this discriminator is used to transform sparse environment rewards into a learned dense reward for world model learning and subsequent planning.
  • Figure 4: Dense reward learning. At each update step, the continuous output of a stage discriminator is added to the environment sparse reward. The discriminator output is normalized to the $[-\beta, \beta]$ interval with a $\tanh$ operator.
  • Figure 5: Learning curves. Success rate as a function of interaction steps for each of the four domains that we consider, averaged across all tasks and 5 random seeds. The shaded area corresponds to a $95\%$ confidence interval. Our method consistently outperforms baselines.
  • ...and 12 more figures