HoRD: Robust Humanoid Control via History-Conditioned Reinforcement Learning and Online Distillation

Puyue Wang; Jiawei Hu; Yan Gao; Junyan Wang; Yu Zhang; Gillian Dobbie; Tao Gu; Wafa Johal; Ting Dang; Hong Jia

HoRD: Robust Humanoid Control via History-Conditioned Reinforcement Learning and Online Distillation

Puyue Wang, Jiawei Hu, Yan Gao, Junyan Wang, Yu Zhang, Gillian Dobbie, Tao Gu, Wafa Johal, Ting Dang, Hong Jia

TL;DR

HoRD tackles robustness of humanoid torque control under domain shift by learning a robust expert policy with history-conditioned dynamics and online adaptation, then distilling it into a sparse-input transformer. It introduces SSJR to unify sparse joint commands and HCDR to infer latent dynamics from history, enabling zero-shot transfer across simulators and environments. Empirical results on AMASS-derived motions show HoRD outperforms baselines in unseen physics engines (Genesis) and under terrain perturbations, with high success rates and substantially lower pose errors. The release of a large-scale trajectory dataset and evaluation scripts supports reproducible cross-domain benchmarking toward deployment-ready, robust humanoid control.

Abstract

Humanoid robots can suffer significant performance drops under small changes in dynamics, task specifications, or environment setup. We propose HoRD, a two-stage learning framework for robust humanoid control under domain shift. First, we train a high-performance teacher policy via history-conditioned reinforcement learning, where the policy infers latent dynamics context from recent state--action trajectories to adapt online to diverse randomized dynamics. Second, we perform online distillation to transfer the teacher's robust control capabilities into a transformer-based student policy that operates on sparse root-relative 3D joint keypoint trajectories. By combining history-conditioned adaptation with online distillation, HoRD enables a single policy to adapt zero-shot to unseen domains without per-domain retraining. Extensive experiments show HoRD outperforms strong baselines in robustness and transfer, especially under unseen domains and external perturbations. Code and project page are available at \href{https://tonywang-0517.github.io/hord/}{https://tonywang-0517.github.io/hord/}.

HoRD: Robust Humanoid Control via History-Conditioned Reinforcement Learning and Online Distillation

TL;DR

Abstract

Paper Structure (61 sections, 15 equations, 8 figures, 7 tables)

This paper contains 61 sections, 15 equations, 8 figures, 7 tables.

Introduction
Related Work
Physics-guided policy learning for humanoid control.
Cross-domain generalization and dynamics adaptation.
Method
Problem Definition
Latent Environment Dynamics.
Observations and Anticipatory Input.
Policy Objective.
Framework Overview
Stage I: Full-Observation Expert Training.
Stage II: Sparse-Observation Student.
SSJR: Standardized Sparse-Joint Representation
Motion Abstraction.
Policy Integration.
...and 46 more sections

Figures (8)

Figure 1: Framework overview. Two-stage teacher--student learning pipeline for robust humanoid control under partial observability. Stage I: an expert policy $\pi^\star$ is trained with PPO in simulation using privileged full-state observations $\mathbf{s}_t^{\text{full}}$, dense future motion intent $\mathbf{Y}_t^{\text{full}}$, and episode-level domain randomization $\boldsymbol{\psi}^{(e)}$. A shared HCDR module encodes the interaction history $\mathcal{H}_t$ into a temporal memory embedding $\mathbf{m}_t$ for online dynamics inference and adaptive modulation. Stage II: a deployable student policy $\pi$ receives only sparse proprioception $\mathbf{s}_t^{\text{sparse}}$, environment context $\mathbf{g}_t$, and standardized motion commands $\mathbf{Y}_t^{\text{sparse}}$ via SSJR, and is trained by distillation to match the expert's actions. SSJR maps a global planner command into a platform-agnostic sparse-joint command interface, enabling cross-platform transfer while HCDR provides in-context adaptation to latent dynamics during deployment.
Figure 2: Results of HoRD on six representative motions, while red markers indicate ground-truth skeleton joints. Qualitative comparison results with baselines are shown in Appendix Fig. 5. With DR and HCDR, the robot is able to mimic a range of different human actions.
Figure 3: Pose estimation errors comparing HoRD and ExBody2 across six representative motions.
Figure 4: Online adaptation experiment shows that a lateral push is applied mid-execution (second frame), perturbing the humanoid’s motion. HoRD rapidly re-stabilizes and resumes the intended trajectory (red markers).
Figure 5: Qualitative comparisons across 6 different motions.
...and 3 more figures

HoRD: Robust Humanoid Control via History-Conditioned Reinforcement Learning and Online Distillation

TL;DR

Abstract

HoRD: Robust Humanoid Control via History-Conditioned Reinforcement Learning and Online Distillation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)