Table of Contents
Fetching ...

HACL: History-Aware Curriculum Learning for Fast Locomotion

Prakhar Mishra, Amir Hossain Raj, Xuesu Xiao, Dinesh Manocha

TL;DR

The paper tackles fast, stable legged locomotion under uncertain, non-Markovian dynamics by introducing History-Aware Curriculum Learning (HACL). HACL uses a recurrent model to capture reward history through the hidden state $h_{t-1}$ and predicts per-bin rewards $\,hat{\mu}(b)$ for a discretized command space with BinID $x_t$, updating bin weights via $w_t(b)$ and guiding curriculum selection with a history-aware utility $u(b)$. A two-level setup decouples the meta-scheduler from the low-level PPO controller, training a predictor with loss $L(\psi) = \frac{1}{T} \sum_{t=1}^T (r_t - \hat{\mu}(b_t))^2$, which accelerates convergence to high-speed, stable gait patterns. The approach yields peak simulation speeds of $6.7$ m/s at a command of $7$ m/s and real-world Go1 speeds of $4.1 \pm 0.2$ m/s on diverse terrains, illustrating strong sim-to-real transfer and generalization across morphologies (Mini Cheetah, Go1, Go2). Overall, HACL improves agility, stability, and energy efficiency while reducing dependence on manual curriculum tuning, offering a practically impactful route to robust high-speed locomotion in unstructured environments.

Abstract

We address the problem of agile and rapid locomotion, a key characteristic of quadrupedal and bipedal robots. We present a new algorithm that maintains stability and generates high-speed trajectories by considering the temporal aspect of locomotion. Our formulation takes into account past information based on a novel history-aware curriculum Learning (HACL) algorithm. We model the history of joint velocity commands with respect to the observed linear and angular rewards using a recurrent neural net (RNN). The hidden state helps the curriculum learn the relationship between the forward linear velocity and angular velocity commands and the rewards over a given time-step. We validate our approach on the MIT Mini Cheetah,Unitree Go1, and Go2 robots in a simulated environment and on a Unitree Go1 robot in real-world scenarios. In practice, HACL achieves peak forward velocity of 6.7 m/s for a given command velocity of 7m/s and outperforms prior locomotion algorithms by nearly 20%.

HACL: History-Aware Curriculum Learning for Fast Locomotion

TL;DR

The paper tackles fast, stable legged locomotion under uncertain, non-Markovian dynamics by introducing History-Aware Curriculum Learning (HACL). HACL uses a recurrent model to capture reward history through the hidden state and predicts per-bin rewards for a discretized command space with BinID , updating bin weights via and guiding curriculum selection with a history-aware utility . A two-level setup decouples the meta-scheduler from the low-level PPO controller, training a predictor with loss , which accelerates convergence to high-speed, stable gait patterns. The approach yields peak simulation speeds of m/s at a command of m/s and real-world Go1 speeds of m/s on diverse terrains, illustrating strong sim-to-real transfer and generalization across morphologies (Mini Cheetah, Go1, Go2). Overall, HACL improves agility, stability, and energy efficiency while reducing dependence on manual curriculum tuning, offering a practically impactful route to robust high-speed locomotion in unstructured environments.

Abstract

We address the problem of agile and rapid locomotion, a key characteristic of quadrupedal and bipedal robots. We present a new algorithm that maintains stability and generates high-speed trajectories by considering the temporal aspect of locomotion. Our formulation takes into account past information based on a novel history-aware curriculum Learning (HACL) algorithm. We model the history of joint velocity commands with respect to the observed linear and angular rewards using a recurrent neural net (RNN). The hidden state helps the curriculum learn the relationship between the forward linear velocity and angular velocity commands and the rewards over a given time-step. We validate our approach on the MIT Mini Cheetah,Unitree Go1, and Go2 robots in a simulated environment and on a Unitree Go1 robot in real-world scenarios. In practice, HACL achieves peak forward velocity of 6.7 m/s for a given command velocity of 7m/s and outperforms prior locomotion algorithms by nearly 20%.

Paper Structure

This paper contains 22 sections, 15 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Testing on diverse terrains (Unitree Go1, deployed HACL).Row 1: On Pebbles (2-3 m) Go1 maintains $2.1 \pm 0.3$ m/s with a success rate of 60% and a lateral shift of $0.3 \pm 0.1$ m. Row 2: On wooden slopes (approximate $20^\circ$ and 3-5 m), the robot achieves $3.1 \pm 0.4$ m/s with a success rate of 80% and a lateral shift of $0.5 \pm 0.3$ m. Row 3: 2-3 m run; $1.5 \pm 0.4$ m/s ; the success rate hovers around 50% and the performance is very similar to Pebbles. Row 4: For angular rotation it achieves $3.7 \pm 0.2$ rad/s and a success rate of 100% for the given angular command velocity. HACL maintains high speed and stability across varied tasks and terrains.
  • Figure 2: HACL overview. Our HACL module receives the observations and rewards from the IsaacGym environment. HACL learns the hidden pattern between these reward distributions and the high-level sampled task parameters. The policy model is optimized using PPO and generates the low-level task parameter commands, which the Unitree Go1 executes in the environment.
  • Figure 3: Curriculum Range Expansion: HACL leverages history to achieve better results by allowing the agent to master the lower range first before proceeding to higher velocities, avoiding pitfalls of instability and failure in the training process. Curriculum begins with the range $[-1.0,1.0]$ and expands by $[-0.5, 0.5]$. As evident from the plot, the policy learns $v_x^{\text{max}}$ faster and more steadily, while $v_y^{\text{max}}$ and $\omega_z^{\text{max}}$ are less consistent and saturate at lower thresholds, owing to our reward optimization of forward locomotion.
  • Figure 4: Binning Criteria: HACL trained for 250, 1000, 4000 and 6000 bins with identical training conditions to measure efficiency and performance in 1500 iterations. Coarse binning like 250 bins and 1000 bins fails to learn high-speed locomotion and gets stuck between the range of 1.5 to 2 m/s. With 4000 bins the robot achieves 6 m/s and reaches 90% of target velocity within 8.6 million steps whereas with 6000 bins robot reaches 6-6.3 m/s and the 90% of target velocity within 9.5 million steps, basically needs more training time to visit each bin leading to greater reward variance.
  • Figure 5: High-speed locomotion with HACL.Row 1 (Real-world, HACL): On woolen carpet policy achieves $4.1 \pm 0.2$ m/s over a 3-8 m run with a task success rate of 90% . Row 2 (Sim, HACL): On command velocity of 7 m/s, Go1 reaches $6.7 \pm 0.2$ m/s and the joint-position time plot shows a stable leg coordination(12 joints: LF, RF, LH, RH ) at higher command velocity. HACL maintains stable, consistent joint position both in the real-world and in simulation (Row 1 & 2)
  • ...and 1 more figures