Table of Contents
Fetching ...

CLF-RL: Control Lyapunov Function Guided Reinforcement Learning

Kejun Li, Zachary Olkin, Yisong Yue, Aaron D. Ames

TL;DR

This work tackles reward design challenges in reinforcement learning for bipedal locomotion by introducing a principled CLF-based reward that leverages model-based reference trajectories. The framework (CLF-RL) integrates two reference generators—the online, reduced-order H-LIP and the offline full-order HZD gait library—to produce desirable targets and a Lyapunov-based objective $V(oldsymbol{ ilde{y}})=oldsymbol{ ilde{y}}^ op P oldsymbol{ ilde{y}}$ that enforces stability through a CLF decrease condition. The method yields a final training signal $R = r_v + r_{ ext{dot}v} + r_{ ext{hol}} + r_{ ext{reg}}$, enabling stable, robust learning and sim-to-real transfer for a 29-DoF humanoid on the Unitree G1, with extensive hardware validation including outdoor tests. By combining model-based references with CLF-guided rewards, the approach reduces manual reward tuning, improves robustness to perturbations and payloads, and provides a modular path to leveraging either reduced- or full-order planners for reliable legged locomotion policies.

Abstract

Reinforcement learning (RL) has shown promise in generating robust locomotion policies for bipedal robots, but often suffers from tedious reward design and sensitivity to poorly shaped objectives. In this work, we propose a structured reward shaping framework that leverages model-based trajectory generation and control Lyapunov functions (CLFs) to guide policy learning. We explore two model-based planners for generating reference trajectories: a reduced-order linear inverted pendulum (LIP) model for velocity-conditioned motion planning, and a precomputed gait library based on hybrid zero dynamics (HZD) using full-order dynamics. These planners define desired end-effector and joint trajectories, which are used to construct CLF-based rewards that penalize tracking error and encourage rapid convergence. This formulation provides meaningful intermediate rewards, and is straightforward to implement once a reference is available. Both the reference trajectories and CLF shaping are used only during training, resulting in a lightweight policy at deployment. We validate our method both in simulation and through extensive real-world experiments on a Unitree G1 robot. CLF-RL demonstrates significantly improved robustness relative to the baseline RL policy and better performance than a classic tracking reward RL formulation.

CLF-RL: Control Lyapunov Function Guided Reinforcement Learning

TL;DR

This work tackles reward design challenges in reinforcement learning for bipedal locomotion by introducing a principled CLF-based reward that leverages model-based reference trajectories. The framework (CLF-RL) integrates two reference generators—the online, reduced-order H-LIP and the offline full-order HZD gait library—to produce desirable targets and a Lyapunov-based objective that enforces stability through a CLF decrease condition. The method yields a final training signal , enabling stable, robust learning and sim-to-real transfer for a 29-DoF humanoid on the Unitree G1, with extensive hardware validation including outdoor tests. By combining model-based references with CLF-guided rewards, the approach reduces manual reward tuning, improves robustness to perturbations and payloads, and provides a modular path to leveraging either reduced- or full-order planners for reliable legged locomotion policies.

Abstract

Reinforcement learning (RL) has shown promise in generating robust locomotion policies for bipedal robots, but often suffers from tedious reward design and sensitivity to poorly shaped objectives. In this work, we propose a structured reward shaping framework that leverages model-based trajectory generation and control Lyapunov functions (CLFs) to guide policy learning. We explore two model-based planners for generating reference trajectories: a reduced-order linear inverted pendulum (LIP) model for velocity-conditioned motion planning, and a precomputed gait library based on hybrid zero dynamics (HZD) using full-order dynamics. These planners define desired end-effector and joint trajectories, which are used to construct CLF-based rewards that penalize tracking error and encourage rapid convergence. This formulation provides meaningful intermediate rewards, and is straightforward to implement once a reference is available. Both the reference trajectories and CLF shaping are used only during training, resulting in a lightweight policy at deployment. We validate our method both in simulation and through extensive real-world experiments on a Unitree G1 robot. CLF-RL demonstrates significantly improved robustness relative to the baseline RL policy and better performance than a classic tracking reward RL formulation.

Paper Structure

This paper contains 11 sections, 1 theorem, 13 equations, 7 figures, 2 tables.

Key Result

Proposition 1

Consider the fully-actuated humanoid system given by Eq. eq:dynamics and satisfying the constraint given by Eq. eq:hol_dynamics and friction cone constraints. If $Q \succ 0$ in the CARE, and the jacobian of the virtual constraints $\frac{\partial y}{\partial q}$ are invertible, then the function $V(

Figures (7)

  • Figure 1: Overview of our approach. A reference generator produces target trajectories, which are used to construct a CLF-based reward. An RL policy is trained in simulation with this reward and deployed on a real humanoid robot.
  • Figure 2: Overview of CLF-guided reinforcement learning framework. A desired velocity $v^d$ is passed to a reference generator (e.g., H-LIP or HZD) to produce targets $y^d_\alpha, \dot{y}^d_\alpha$. These, along with the robot state and privileged variables $o_t^\text{priv}$, are used to compute a CLF-based reward. The actor-critic policy is trained with this reward and outputs joint targets $q_\text{target}$ for the robot.
  • Figure 3: Tracking performance with torso mass randomly displaced within a box of size $\pm [0.05\text{ (x)}, 0.05\text{ (y)}, 0.01\text{ (z)}]$m around the nominal location. Fifty displacements are uniformly sampled, and the resulting mean and standard deviation of performance for each policy are plotted.
  • Figure 4: Robustness evaluation in simulation: HZD-CLF, LIP-CLF and the baseline policy are tested with an extra 8kg on the torso. A 2-second ramp commanded each controller up to the maximum training velocity. The steady-state mean velocities are indicated by dashed lines.
  • Figure 5: Snapshots of the three policies throughout a stride on the Unitree G1 robot. These images depict the walking motion in steady state walking with a commanded velocity of $v_x^d = 0.75$ m/s.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Proposition 1: CLF Existence
  • proof