Table of Contents
Fetching ...

Guided Reinforcement Learning for Robust Multi-Contact Loco-Manipulation

Jean-Pierre Sleiman, Mayank Mittal, Marco Hutter

TL;DR

This work proposes a systematic approach to behavior synthesis and control for multi-contact loco-manipulation tasks, such as navigating spring-loaded doors and manipulating heavy dishwashers, using a task-independent MDP to train RL policies using only a single demonstration per task generated from a model-based trajectory optimizer.

Abstract

Reinforcement learning (RL) often necessitates a meticulous Markov Decision Process (MDP) design tailored to each task. This work aims to address this challenge by proposing a systematic approach to behavior synthesis and control for multi-contact loco-manipulation tasks, such as navigating spring-loaded doors and manipulating heavy dishwashers. We define a task-independent MDP to train RL policies using only a single demonstration per task generated from a model-based trajectory optimizer. Our approach incorporates an adaptive phase dynamics formulation to robustly track the demonstrations while accommodating dynamic uncertainties and external disturbances. We compare our method against prior motion imitation RL works and show that the learned policies achieve higher success rates across all considered tasks. These policies learn recovery maneuvers that are not present in the demonstration, such as re-grasping objects during execution or dealing with slippages. Finally, we successfully transfer the policies to a real robot, demonstrating the practical viability of our approach.

Guided Reinforcement Learning for Robust Multi-Contact Loco-Manipulation

TL;DR

This work proposes a systematic approach to behavior synthesis and control for multi-contact loco-manipulation tasks, such as navigating spring-loaded doors and manipulating heavy dishwashers, using a task-independent MDP to train RL policies using only a single demonstration per task generated from a model-based trajectory optimizer.

Abstract

Reinforcement learning (RL) often necessitates a meticulous Markov Decision Process (MDP) design tailored to each task. This work aims to address this challenge by proposing a systematic approach to behavior synthesis and control for multi-contact loco-manipulation tasks, such as navigating spring-loaded doors and manipulating heavy dishwashers. We define a task-independent MDP to train RL policies using only a single demonstration per task generated from a model-based trajectory optimizer. Our approach incorporates an adaptive phase dynamics formulation to robustly track the demonstrations while accommodating dynamic uncertainties and external disturbances. We compare our method against prior motion imitation RL works and show that the learned policies achieve higher success rates across all considered tasks. These policies learn recovery maneuvers that are not present in the demonstration, such as re-grasping objects during execution or dealing with slippages. Finally, we successfully transfer the policies to a real robot, demonstrating the practical viability of our approach.

Paper Structure

This paper contains 26 sections, 4 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: A framework for learning loco-manipulation tasks, such as traversing a spring-loaded door and manipulating dishwashers. A single demonstration guides the RL training process to learn multi-contact behaviors (such as using the feet or the arm for interaction) without task-specific handcrafted rewards.
  • Figure 2: The loco-manipulation plannerVersatileMulticontact generates references in the form of multi-modal plans consisting of continuous trajectories $\bf{X}^*$ and manipulation schedules $\bf{M}^*$. These are used by the demonstration-guided controller to select $\bf{x}^*$ and $\bf{m}^*$ adaptively based on the task phase $\phi$ and track them robustly. The controller receives full-state feedback and sends joint position commands to the robot.
  • Figure 3: Manipulation schedules in the generated loco-manipulation demonstrations. For the quadrupedal mobile manipulator, ALMA, the end-effectors are: ARM: tool attached to a 6-DoF robotic arm, LF: left front foot, RF: right front foot, LH: left hind foot, and RH: right hind foot.
  • Figure 4: Trajectories for the door-pulling task involving a loss of contact (indicated by shaded regions). The top and bottom rows show the door and task phase trajectories.
  • Figure 5: Effect of randomization for the door pulling task. Grey: The DR region used during training. Red: The nominal parameter.