Table of Contents
Fetching ...

MoDem-V2: Visuo-Motor World Models for Real-World Robot Manipulation

Patrick Lancaster, Nicklas Hansen, Aravind Rajeswaran, Vikash Kumar

TL;DR

MoDem-V2 addresses safety and sample-efficiency challenges in learning visuo-motor manipulation directly on real robots from raw visual inputs with sparse rewards. It extends MoDem by introducing policy-centered action sampling, agency transfer from BC to MPC, and actor-critic ensembles to enable uncertainty-aware, conservative planning in a latent space defined by a learned world model $(h_{\theta}, d_{\theta}, R_{\theta}, Q_{\theta})$ with policy $\pi_{\theta}$. In simulation and on a real Panda-based platform, MoDem-V2 achieves comparable or better sample efficiency than baselines while significantly reducing safety violations, enabling four tasks including pushing, bin-picking, and in-hand manipulation from ten demonstrations. By delivering practical safety-first strategies and open-source resources, it advances demonstration-augmented visual MBRL for real-world robotics.

Abstract

Robotic systems that aspire to operate in uninstrumented real-world environments must perceive the world directly via onboard sensing. Vision-based learning systems aim to eliminate the need for environment instrumentation by building an implicit understanding of the world based on raw pixels, but navigating the contact-rich high-dimensional search space from solely sparse visual reward signals significantly exacerbates the challenge of exploration. The applicability of such systems is thus typically restricted to simulated or heavily engineered environments since agent exploration in the real-world without the guidance of explicit state estimation and dense rewards can lead to unsafe behavior and safety faults that are catastrophic. In this study, we isolate the root causes behind these limitations to develop a system, called MoDem-V2, capable of learning contact-rich manipulation directly in the uninstrumented real world. Building on the latest algorithmic advancements in model-based reinforcement learning (MBRL), demo-bootstrapping, and effective exploration, MoDem-V2 can acquire contact-rich dexterous manipulation skills directly in the real world. We identify key ingredients for leveraging demonstrations in model learning while respecting real-world safety considerations -- exploration centering, agency handover, and actor-critic ensembles. We empirically demonstrate the contribution of these ingredients in four complex visuo-motor manipulation problems in both simulation and the real world. To the best of our knowledge, our work presents the first successful system for demonstration-augmented visual MBRL trained directly in the real world. Visit https://sites.google.com/view/modem-v2 for videos and more details.

MoDem-V2: Visuo-Motor World Models for Real-World Robot Manipulation

TL;DR

MoDem-V2 addresses safety and sample-efficiency challenges in learning visuo-motor manipulation directly on real robots from raw visual inputs with sparse rewards. It extends MoDem by introducing policy-centered action sampling, agency transfer from BC to MPC, and actor-critic ensembles to enable uncertainty-aware, conservative planning in a latent space defined by a learned world model with policy . In simulation and on a real Panda-based platform, MoDem-V2 achieves comparable or better sample efficiency than baselines while significantly reducing safety violations, enabling four tasks including pushing, bin-picking, and in-hand manipulation from ten demonstrations. By delivering practical safety-first strategies and open-source resources, it advances demonstration-augmented visual MBRL for real-world robotics.

Abstract

Robotic systems that aspire to operate in uninstrumented real-world environments must perceive the world directly via onboard sensing. Vision-based learning systems aim to eliminate the need for environment instrumentation by building an implicit understanding of the world based on raw pixels, but navigating the contact-rich high-dimensional search space from solely sparse visual reward signals significantly exacerbates the challenge of exploration. The applicability of such systems is thus typically restricted to simulated or heavily engineered environments since agent exploration in the real-world without the guidance of explicit state estimation and dense rewards can lead to unsafe behavior and safety faults that are catastrophic. In this study, we isolate the root causes behind these limitations to develop a system, called MoDem-V2, capable of learning contact-rich manipulation directly in the uninstrumented real world. Building on the latest algorithmic advancements in model-based reinforcement learning (MBRL), demo-bootstrapping, and effective exploration, MoDem-V2 can acquire contact-rich dexterous manipulation skills directly in the real world. We identify key ingredients for leveraging demonstrations in model learning while respecting real-world safety considerations -- exploration centering, agency handover, and actor-critic ensembles. We empirically demonstrate the contribution of these ingredients in four complex visuo-motor manipulation problems in both simulation and the real world. To the best of our knowledge, our work presents the first successful system for demonstration-augmented visual MBRL trained directly in the real world. Visit https://sites.google.com/view/modem-v2 for videos and more details.
Paper Structure (20 sections, 5 equations, 9 figures, 1 table, 1 algorithm)

This paper contains 20 sections, 5 equations, 9 figures, 1 table, 1 algorithm.

Figures (9)

  • Figure 1: We use MoDem-V2 to train the robot on four contact-rich manipulation tasks. These tasks cover a wide range of manipulation skills, namely non-prehensile pushing, object picking, and in-hand manipulation. In recognition of the difficulty of robust pose tracking and dense reward specification in the real world, the robot performs these tasks using only raw visual feedback, proprioceptive signals, and sparse rewards.
  • Figure 2: Agent performance on the inclined pushing task before failure due to safety violations. An asterisk indicates that agent training was terminated due to significant safety violations. Left: On a real-world robot, MoDem violates (robot manufacturer specified) torque limits at the onset of online interaction and is unable to learn, whereas MoDem-V2's conservative exploration allows it to perfect the task. Right: Further evaluation in simulation reveals that simply penalizing the amount of torque exerted by the robot does not prevent termination due to significant safety violations. Other baseline agents are either terminated due to unsafe behavior or achieve significantly lower success than MoDem-V2.
  • Figure 3: A view of the in-hand reorientation task as an example of our hardware setup.
  • Figure 4: The number of safety violations as defined in \ref{['sec:safety']} (top row) and success rate (bottom row) for each of the four manipulation tasks in simulation. Lower is better for safety violations while higher is better for episode success. While both MoDem-V2 and MoDem achieve similar or better sample-efficiency than all of the baselines, MoDem-V2 exhibits significantly safer learning as evidenced by the drastically lower amount of safety violations.
  • Figure 5: Ablations of the three MoDem-V2 enhancements for all four tasks. Lower is better for safety violations (top row) while higher is better for episode success (bottom row). MoDem-V2 achieves both the higher sample-efficiency of Ensemble and the increased safety profile of Schedule.
  • ...and 4 more figures