Table of Contents
Fetching ...

Heracles: Bridging Precise Tracking and Generative Synthesis for General Humanoid Control

Zelin Tao, Zeran Su, Peiran Liu, Jingkai Sun, Wenqiang Que, Jiahao Ma, Jialin Yu, Jiahang Cao, Pihai Sun, Hao Liang, Gang Han, Wen Zhao, Zhiyuan Xu, Jian Tang, Qiang Zhang, Yijie Guo

Abstract

Achieving general-purpose humanoid control requires a delicate balance between the precise execution of commanded motions and the flexible, anthropomorphic adaptability needed to recover from unpredictable environmental perturbations. Current general controllers predominantly formulate motion control as a rigid reference-tracking problem. While effective in nominal conditions, these trackers often exhibit brittle, non-anthropomorphic failure modes under severe disturbances, lacking the generative adaptability inherent to human motor control. To overcome this limitation, we propose Heracles, a novel state-conditioned diffusion middleware that bridges precise motion tracking and generative synthesis. Rather than relying on rigid tracking paradigms or complex explicit mode-switching, Heracles operates as an intermediary layer between high-level reference motions and low-level physics trackers. By conditioning on the robot's real-time state, the diffusion model implicitly adapts its behavior: it approximates an identity map when the state closely aligns with the reference, preserving zero-shot tracking fidelity. Conversely, when encountering significant state deviations, it seamlessly transitions into a generative synthesizer to produce natural, anthropomorphic recovery trajectories. Our framework demonstrates that integrating generative priors into the control loop not only significantly enhances robustness against extreme perturbations but also elevates humanoid control from a rigid tracking paradigm to an open-ended, generative general-purpose architecture.

Heracles: Bridging Precise Tracking and Generative Synthesis for General Humanoid Control

Abstract

Achieving general-purpose humanoid control requires a delicate balance between the precise execution of commanded motions and the flexible, anthropomorphic adaptability needed to recover from unpredictable environmental perturbations. Current general controllers predominantly formulate motion control as a rigid reference-tracking problem. While effective in nominal conditions, these trackers often exhibit brittle, non-anthropomorphic failure modes under severe disturbances, lacking the generative adaptability inherent to human motor control. To overcome this limitation, we propose Heracles, a novel state-conditioned diffusion middleware that bridges precise motion tracking and generative synthesis. Rather than relying on rigid tracking paradigms or complex explicit mode-switching, Heracles operates as an intermediary layer between high-level reference motions and low-level physics trackers. By conditioning on the robot's real-time state, the diffusion model implicitly adapts its behavior: it approximates an identity map when the state closely aligns with the reference, preserving zero-shot tracking fidelity. Conversely, when encountering significant state deviations, it seamlessly transitions into a generative synthesizer to produce natural, anthropomorphic recovery trajectories. Our framework demonstrates that integrating generative priors into the control loop not only significantly enhances robustness against extreme perturbations but also elevates humanoid control from a rigid tracking paradigm to an open-ended, generative general-purpose architecture.

Paper Structure

This paper contains 20 sections, 21 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Heracles synthesizes diverse, anthropomorphic recovery motions via state-conditioned diffusion. In contrast to recovery policies trained with termination tricks that converge to a limited set of stereotypical maneuvers, Heracles leverages its generative middleware to produce a rich repertoire of agile, human-like recovery behaviors, enabling more general and robust responses across a wide range of extreme perturbations.
  • Figure 2: Overview of the Heracles framework.(a) A flow matching model $\hat{D}_\theta$ learns to synthesize feasible keyframe trajectories conditioned on the current state. (b) Reference motions are quantized into discrete tokens via FSQ, shared by reconstruction and action prediction heads. (c) At inference, the middleware generates trajectories through closed-loop replanning for the motion tracker to execute.
  • Figure 3: Qualitative sim-to-sim comparison on an out-of-distribution martial arts sequence. Each row shows a different method tracking the same reference motion on a Unitree G1 humanoid alongside a reference ghost in MuJoCo. MLP, Transformer, and SONIC collapse early; VQ-VAE barely tracks the motion. Among our ablations, iFSQBM and iFSQ survive without falling, while iFSQ+H falls but later recovers. Heracles (Ours) tracks the full sequence most accurately, demonstrating the strongest robustness to OOD motions.
  • Figure 4: Real-world motion tracking across diverse and dynamic behaviors. Real-world experiments demonstrate that our model generalizes to a broad spectrum of motions, from everyday locomotion (walk, run) to highly dynamic skills (kick, 360° kick) and human-object interaction.
  • Figure 5: Qualitative sim-to-sim comparison on an OOD lie-to-stand sequence. Same setup as \ref{['fig:arts']}. MLP, Transformer, SONIC, and iFSQBM fail to stand up; VQ-VAE, iFSQ+H, and iFSQ partially track the motion. Heracles (Ours) completes the full lie-to-stand transition and most accurately tracks the root position.
  • ...and 2 more figures