Table of Contents
Fetching ...

TransCurriculum: Multi-Dimensional Curriculum Learning for Fast & Stable Locomotion

Prakhar Mishra, Amir Hossain Raj, Xuesu Xiao, Dinesh Manocha

Abstract

High-speed legged locomotion struggles with stability and transfer losses at higher command velocities during deployment. One reason is that most curricula vary difficulty along single axis, for example increase the range of command velocities, terrain difficulty, or domain parameters (e.g. friction or payload mass) using either fixed update rule or instantaneous rewards while ignoring how the history of robot training has evolved. We propose TransCurriculum, a transformer-based multi-dimensional curriculum learning approach for agile quadrupedal locomotion. TransCurriculum adapts to 3 axes, velocity command targets, terrain difficulty, and domain randomization parameters (friction and payload mass). Rather than feeding task reward history directly into the low-level control policy, our formulation exploits it at the curriculum level. A transformer-based teacher retrieves the sequence of rewards and uses it to predict future rewards, success rate, and learning progress to guide expansion of this multidimensional curriculum towards high performing task bins. Finally we validate our approach on the Unitree Go1 robot in simulation (Isaac Gym) and deploy it zero-shot on Go1 hardware. Our TransCurriculum policy achieves a maximum velocity of 6.3 m/s in simulation and outperforms prior curriculum baselines. We tested our TransCurriculum trained policy on terrains (carpets, slopes, tiles, concrete), achieving a forward velocity of 4.1 m/s on carpet surpassing the fastest curriculum methods by 18.8% and achieves maximum zero-shot value among all tested methods. Our multi-dimensional curriculum also reduces the transfer loss to 18% from 27% for command only curriculum, demonstrating the benefits of joint training over velocity, terrain and domain randomization dimension while keeping the task success rate of 80-90% on rigid indoor and outdoor surfaces.

TransCurriculum: Multi-Dimensional Curriculum Learning for Fast & Stable Locomotion

Abstract

High-speed legged locomotion struggles with stability and transfer losses at higher command velocities during deployment. One reason is that most curricula vary difficulty along single axis, for example increase the range of command velocities, terrain difficulty, or domain parameters (e.g. friction or payload mass) using either fixed update rule or instantaneous rewards while ignoring how the history of robot training has evolved. We propose TransCurriculum, a transformer-based multi-dimensional curriculum learning approach for agile quadrupedal locomotion. TransCurriculum adapts to 3 axes, velocity command targets, terrain difficulty, and domain randomization parameters (friction and payload mass). Rather than feeding task reward history directly into the low-level control policy, our formulation exploits it at the curriculum level. A transformer-based teacher retrieves the sequence of rewards and uses it to predict future rewards, success rate, and learning progress to guide expansion of this multidimensional curriculum towards high performing task bins. Finally we validate our approach on the Unitree Go1 robot in simulation (Isaac Gym) and deploy it zero-shot on Go1 hardware. Our TransCurriculum policy achieves a maximum velocity of 6.3 m/s in simulation and outperforms prior curriculum baselines. We tested our TransCurriculum trained policy on terrains (carpets, slopes, tiles, concrete), achieving a forward velocity of 4.1 m/s on carpet surpassing the fastest curriculum methods by 18.8% and achieves maximum zero-shot value among all tested methods. Our multi-dimensional curriculum also reduces the transfer loss to 18% from 27% for command only curriculum, demonstrating the benefits of joint training over velocity, terrain and domain randomization dimension while keeping the task success rate of 80-90% on rigid indoor and outdoor surfaces.
Paper Structure (25 sections, 16 equations, 6 figures, 4 tables)

This paper contains 25 sections, 16 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Zero-shot hardware evaluation on diverse terrain (Unitree Go1, TransCurriculum). We deploy Transcurriculum on Go1 and report speed, success rate and lateral deviation over short runs. Row 1: Pebbles (2-3 m) Go1 maintains $2.1 \pm 0.3$ m/s, 60% success, $0.3 \pm 0.1$ m lateral deviation. Row 2: Wooden slopes (approximate $20^\circ$ and 3-5 m) $3.1 \pm 0.4$ m/s, 80% success, and lateral deviation of $0.5 \pm 0.3$ m. Row 3: Rocks (2-3 m) $1.5 \pm 0.4$ m/s, 50%, success and $0.34 \pm 0.1$ m lateral deviation. The policy remains functional across terrain without finetuning, with some performance degradation with terrain difficulty,supporting benefits of history-aware & multi-dimensional training.
  • Figure 2: TransCurriculum pipeline. The low-level policy$\pi_{\theta}$ is trained with PPO using rollouts from IsaacGym environment. While the TransCurriculum module maintains bins over multi-dimensional task space of commands, terrain difficulty and domain parameters and use rollouts ( observations and rewards) to update distribution. The transformer-curriculum retrieves context-outcome history, to predict the rewards, success and progress of these curriculum bins. These sampled task-context$z$ is applied to the simulator for the next PPO rollout.
  • Figure 3: Curriculum range expansion over command space: TransCurriculum starts from a narrow command range of $[-1.0,1.0]$ and gradually expands by $[-0.5, 0.5]$ as the policy achieves stable velocity tracking performance. We plot the maximum sampled command values during training, $v_x^{\text{max}}$, $v_y^{\text{max}}$ and $\omega_z^{\text{max}}$, as TransCurriculum expands the command range. The curriculum steadily expands along forward speed $v_x^{\text{max}}$ , while lateral and yaw-rate saturate at lower thresholds, consistent with our reward shaping for forward locomotion.
  • Figure 4: Effect of curriculum bin selection criteria: We train TransCurriculum for 250, 1000, 4000 and 6000 bins under identical training conditions and compare the learning curve (1500 PPO updates). Coarse binning (250/1000) do not reach the high-speed within the given training time, plateauing around $1.5$ -- $2.0$ m/s. The 4000-bin configuration achieves approximately 6 m/s and reaches 90% of target speed within 8.6M environment steps, providing the best tradeoff between exploration and stable curriculum updates. With increasing resolution to 6000 bins increases velocity to $6-6.3$ m/s and the 90% of target speed within 9.5M steps. We therefore use 4000 bins in our experiment unless noted otherwise.
  • Figure 5: Disturbance recovery cases on diverse terrains (zero-shot Go1): The above cases demonstrates that our policy experiences a bump/trip or deviate laterally, and re-stabilizes to its normal gait within few seconds. Above examples demonstrate that TransCurriculum-trained policy remains functional without any additional finetuning under disturbances or terrain irregularities (See accompanying video).
  • ...and 1 more figures