RL + Model-based Control: Using On-demand Optimal Control to Learn Versatile Legged Locomotion

Dongho Kang; Jin Cheng; Miguel Zamora; Fatemeh Zargarbashi; Stelian Coros

RL + Model-based Control: Using On-demand Optimal Control to Learn Versatile Legged Locomotion

Dongho Kang, Jin Cheng, Miguel Zamora, Fatemeh Zargarbashi, Stelian Coros

TL;DR

This paper tackles the challenge of achieving versatile, robust legged locomotion across gaits, velocities, and terrains by fusing model-based optimal control (MBOC) with reinforcement learning (RL). It introduces on-demand reference motions generated by a finite-horizon OCP using a Variable Height Inverted Pendulum Model (VHIPM) to guide a deep RL policy that imitates both base and foot trajectories. The key contributions include on-demand reference motion generation for training, a single RL policy capable of diverse gait patterns without robot-specific reward tuning, and hardware validation on Go1 and Aliengo demonstrating strong sim-to-real transfer. The approach offers a scalable, data-efficient pathway to robust legged control applicable to multiple quadruped platforms, reducing hand-engineering while preserving dynamic capabilities.

Abstract

This paper presents a control framework that combines model-based optimal control and reinforcement learning (RL) to achieve versatile and robust legged locomotion. Our approach enhances the RL training process by incorporating on-demand reference motions generated through finite-horizon optimal control, covering a broad range of velocities and gaits. These reference motions serve as targets for the RL policy to imitate, leading to the development of robust control policies that can be learned with reliability. Furthermore, by utilizing realistic simulation data that captures whole-body dynamics, RL effectively overcomes the inherent limitations in reference motions imposed by modeling simplifications. We validate the robustness and controllability of the RL training process within our framework through a series of experiments. In these experiments, our method showcases its capability to generalize reference motions and effectively handle more complex locomotion tasks that may pose challenges for the simplified model, thanks to RL's flexibility. Additionally, our framework effortlessly supports the training of control policies for robots with diverse dimensions, eliminating the necessity for robot-specific adjustments in the reward function and hyperparameters.

RL + Model-based Control: Using On-demand Optimal Control to Learn Versatile Legged Locomotion

TL;DR

Abstract

Paper Structure (15 sections, 7 equations, 7 figures, 3 tables)

This paper contains 15 sections, 7 equations, 7 figures, 3 tables.

Introduction
Related Work
Model-based Optimal Control for Legged Locomotion
RL-based Legged Locomotion
Overview
Reference Motion Synthesis
The Variable Height Inverted Pendulum Model
Finite-horizon Optimal Control
Motion Imitation with Deep RL
Observation and Action Space
Reward Definition
Results
Simulation Experiments
Hardware Experiments
Conclusion and Future Work

Figures (7)

Figure 1: The snapshots of the quadruped robot Unitree Go1 (top) and Unitree Aliengo (bottom) engaged in various locomotion tasks.
Figure 2: Overview of our framework. The objective is to train a policy that outputs joint actions to imitate the reference. The reward signal quantifies the similarity between the robot's state and reference motions generated by the MOC-based motion generator.
Figure 3: Quadrupedal robot Unitree Go1 represented as a variable-height inverted pendulum (left) and a diagram of its xz-plane projection (right). The VHIPM expresses a robot's CoM acceleration as a function of its CoM position, desired vertical acceleration, and CoP position.
Figure 4: To observe behaviors of the MPC (in blue), Baseline (in purple) and Ours (in yellow) in more detail, we plot the profile of forward velocity, base height, and front-left feet trajectory for commanded velocity 0.5m/s for pronk (first row) and 0.3m/s for bound (second row) respectively. The colored lines are the mean of each quantity obtained from five trainings with different seeds and the shaded areas are the corresponding standard deviations. The red dotted lines stand for command velocity and motion parameters.
Figure 5: Snapshot of perturbation test (right) and the maximum pushing force along different directions that each policy withstands (left). The radial axis of the plot is in log scale.
...and 2 more figures

RL + Model-based Control: Using On-demand Optimal Control to Learn Versatile Legged Locomotion

TL;DR

Abstract

RL + Model-based Control: Using On-demand Optimal Control to Learn Versatile Legged Locomotion

Authors

TL;DR

Abstract

Table of Contents

Figures (7)