Table of Contents
Fetching ...

TAR: Teacher-Aligned Representations via Contrastive Learning for Quadrupedal Locomotion

Amr Mousa, Neil Karavis, Michele Caprio, Wei Pan, Richard Allmendinger

TL;DR

This work tackles generalization gaps in quadrupedal RL caused by misaligned privileged and proprioceptive representations and covariate shift. It proposes TAR, a framework that uses a privileged teacher to shape representations via a contrastive (triplet) objective, while the student learns through proprioceptive inputs and PPO, enabling efficient training and robust OOD generalization. A deployable fine-tuning path without privileged data allows real-world continual adaptation, validated by extensive simulation results and zero-shot hardware experiments on a Unitree Go2. TAR achieves faster training, better generalization than strong baselines, and practical deployability for real-world autonomous locomotion.

Abstract

Quadrupedal locomotion via Reinforcement Learning (RL) is commonly addressed using the teacher-student paradigm, where a privileged teacher guides a proprioceptive student policy. However, key challenges such as representation misalignment between privileged teacher and proprioceptive-only student, covariate shift due to behavioral cloning, and lack of deployable adaptation; lead to poor generalization in real-world scenarios. We propose Teacher-Aligned Representations via Contrastive Learning (TAR), a framework that leverages privileged information with self-supervised contrastive learning to bridge this gap. By aligning representations to a privileged teacher in simulation via contrastive objectives, our student policy learns structured latent spaces and exhibits robust generalization to Out-of-Distribution (OOD) scenarios, surpassing the fully privileged "Teacher". Results showed accelerated training by 2x compared to state-of-the-art baselines to achieve peak performance. OOD scenarios showed better generalization by 40% on average compared to existing methods. Moreover, TAR transitions seamlessly into learning during deployment without requiring privileged states, setting a new benchmark in sample-efficient, adaptive locomotion and enabling continual fine-tuning in real-world scenarios. Open-source code and videos are available at https://amrmousa.com/TARLoco/.

TAR: Teacher-Aligned Representations via Contrastive Learning for Quadrupedal Locomotion

TL;DR

This work tackles generalization gaps in quadrupedal RL caused by misaligned privileged and proprioceptive representations and covariate shift. It proposes TAR, a framework that uses a privileged teacher to shape representations via a contrastive (triplet) objective, while the student learns through proprioceptive inputs and PPO, enabling efficient training and robust OOD generalization. A deployable fine-tuning path without privileged data allows real-world continual adaptation, validated by extensive simulation results and zero-shot hardware experiments on a Unitree Go2. TAR achieves faster training, better generalization than strong baselines, and practical deployability for real-world autonomous locomotion.

Abstract

Quadrupedal locomotion via Reinforcement Learning (RL) is commonly addressed using the teacher-student paradigm, where a privileged teacher guides a proprioceptive student policy. However, key challenges such as representation misalignment between privileged teacher and proprioceptive-only student, covariate shift due to behavioral cloning, and lack of deployable adaptation; lead to poor generalization in real-world scenarios. We propose Teacher-Aligned Representations via Contrastive Learning (TAR), a framework that leverages privileged information with self-supervised contrastive learning to bridge this gap. By aligning representations to a privileged teacher in simulation via contrastive objectives, our student policy learns structured latent spaces and exhibits robust generalization to Out-of-Distribution (OOD) scenarios, surpassing the fully privileged "Teacher". Results showed accelerated training by 2x compared to state-of-the-art baselines to achieve peak performance. OOD scenarios showed better generalization by 40% on average compared to existing methods. Moreover, TAR transitions seamlessly into learning during deployment without requiring privileged states, setting a new benchmark in sample-efficient, adaptive locomotion and enabling continual fine-tuning in real-world scenarios. Open-source code and videos are available at https://amrmousa.com/TARLoco/.

Paper Structure

This paper contains 26 sections, 4 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 0: Generated terrains for training and testing, adapted from lee2020overchallenge. We extend this setup by introducing challenging rail crossings with steep 25 cm steps, encouraging the robot to develop more robust locomotion strategies.
  • Figure 1: The training framework includes a teacher encoder $f_T$ that processes privileged states $S$ to generate structured latent representations $Z^T$. The student encoder $f_S$ extracts proprioceptive features $Z^S$ from observation $O_{t}$ and hidden states $h_{t-1}$. Our triplet loss pulls the student’s next-state prediction $\tilde{Z}_{t+1}^{+}$ close to the teacher’s encoding $Z_{t+1}$ and away from the teacher's encoding $Z_{t+1}^-$ of other contexts sampled from the buffer. The policy gradient loss updates the actor and the critic, while the latter is also updated by the triplet loss. The velocity estimator's output is regressed with the ground truth velocity and is frozen after training to ensure future deployment adaptability
  • Figure 2: During adaptation or privileged-free learning, the teacher encoder $f_T$ is removed, and student encoder $f_S$ constructs positive and negative sample pairs from the current agent’s proprioceptive observations $O_{t+1}$ and those of another agent $O_{t+1}^{j \neq i}$, along with their respective hidden states $h_{t+1}$ and $h_{t+1}^{j \neq i}$. This structured sampling enforces temporal consistency in the latent space, ensuring the student encoder learns meaningful representations without direct supervision. The absence of privileged teacher supervision makes the architecture inherently off-policy compatible and facilitates robust fine-tuning in dynamic and non-stationary environments.
  • Figure 3: Training results of baseline algorithms and our model variants across three seeds. [Left]: Training reward, [Middle]: Terrain level, and [Right]: Our weighted performance metric, computed as: $M_{\text{train}} = 0.25 \times \text{Normalized Terrain Level} + 0.6 \times \text{Normalized Mean Reward} + 0.15 \times \text{Normalized Episode Length}$.
  • Figure 4: Evaluation results of all models across and settings.
  • ...and 1 more figures