Table of Contents
Fetching ...

RL-Augmented MPC for Non-Gaited Legged and Hybrid Locomotion

Andrea Patrizi, Carlo Rizzardo, Arturo Laurenzi, Francesco Ruscelli, Luca Rossini, Nikos G. Tsagarakis

TL;DR

Across all platforms, the proposed contact-explicit hierarchical architecture coupling Reinforcement Learning and Model Predictive Control achieves zero-shot sim-to-sim transfer without domain randomization, and it is demonstrated that only a minimal set of rewards and limited tuning are required to obtain effective policies.

Abstract

We propose a contact-explicit hierarchical architecture coupling Reinforcement Learning (RL) and Model Predictive Control (MPC), where a high-level RL agent provides gait and navigation commands to a low-level locomotion MPC. This offloads the combinatorial burden of contact timing from the MPC by learning acyclic gaits through trial and error in simulation. We show that only a minimal set of rewards and limited tuning are required to obtain effective policies. We validate the architecture in simulation across robotic platforms spanning 50 kg to 120 kg and different MPC implementations, observing the emergence of acyclic gaits and timing adaptations in flat-terrain legged and hybrid locomotion, and further demonstrating extensibility to non-flat terrains. Across all platforms, we achieve zero-shot sim-to-sim transfer without domain randomization, and we further demonstrate zero-shot sim-to-real transfer without domain randomization on Centauro, our 120 kg wheeled-legged humanoid robot. We make our software framework and evaluation results publicly available at https://github.com/AndrePatri/AugMPC.

RL-Augmented MPC for Non-Gaited Legged and Hybrid Locomotion

TL;DR

Across all platforms, the proposed contact-explicit hierarchical architecture coupling Reinforcement Learning and Model Predictive Control achieves zero-shot sim-to-sim transfer without domain randomization, and it is demonstrated that only a minimal set of rewards and limited tuning are required to obtain effective policies.

Abstract

We propose a contact-explicit hierarchical architecture coupling Reinforcement Learning (RL) and Model Predictive Control (MPC), where a high-level RL agent provides gait and navigation commands to a low-level locomotion MPC. This offloads the combinatorial burden of contact timing from the MPC by learning acyclic gaits through trial and error in simulation. We show that only a minimal set of rewards and limited tuning are required to obtain effective policies. We validate the architecture in simulation across robotic platforms spanning 50 kg to 120 kg and different MPC implementations, observing the emergence of acyclic gaits and timing adaptations in flat-terrain legged and hybrid locomotion, and further demonstrating extensibility to non-flat terrains. Across all platforms, we achieve zero-shot sim-to-sim transfer without domain randomization, and we further demonstrate zero-shot sim-to-real transfer without domain randomization on Centauro, our 120 kg wheeled-legged humanoid robot. We make our software framework and evaluation results publicly available at https://github.com/AndrePatri/AugMPC.
Paper Structure (34 sections, 5 equations, 8 figures, 2 tables)

This paper contains 34 sections, 5 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Our hierarchical RL-MPC architecture has been trained and evaluated on several robotic platforms with different morphologies and weight distributions ($50~\mathrm{kg}$ to $120~\mathrm{kg}$), successfully transferred zero-shot across different domains without any domain randomization, and applied to both hybrid and standard locomotion tasks. Supplementary videos are available at \videoURL.
  • Figure 2: The proposed hierarchical architecture uses an RL policy to generate both contact schedules and navigation commands for the underlying MPC.
  • Figure 3: High-level software architecture. The system is modular, with three main components: world interface, MPC cluster and training environment. Green modules operate on GPU-resident data; orange modules run on CPU.
  • Figure 4: Task tracking episode sub-reward (first row) and CoT metric (second row) during training for all the flat terrain case studies, averaged across 800 environments. Dark shaded areas indicate the first and third quartiles of the distribution across environments; lighter regions denote the $5\text{-th}$ and $95\text{-th}$ percentiles. Action rate sub-rewards are omitted for brevity.
  • Figure 5: Deterministic evaluation of a legged locomotion policy for Centauro, shown over a $50\,\mathrm{s}$ window. The injection requests actions $\bm\chi_{\mathrm{MPC}}$ chosen by the policy (b) generate the contact schedule and motion (a), revealing completely acyclic contact patterns and timing adaptations. The associated tracking performance is shown in (c). Flight phases are injected within the MPC horizon, for each foot $j$, only when ${\chi}_{\mathrm{MPC}}^{(j)}<0$.
  • ...and 3 more figures