Table of Contents
Fetching ...

Commanding Humanoid by Free-form Language: A Large Language Action Model with Unified Motion Vocabulary

Zhirui Liu, Kaiyang Ji, Ke Yang, Jingyi Yu, Ye Shi, Jingya Wang

TL;DR

<3-5 sentence high-level summary>Open-ended language-conditioned control for humanoid whole-body motion remains challenging due to data scarcity and embodiment gaps. The authors introduce Humanoid-LLA, a three-part framework that unifies human and humanoid motion into a shared discrete vocabulary via cross-embodiment VQ-VAE, distills a token-based policy from a privileged tracker, and trains a Large Language Action Model with supervised and physics-informed RL. A vocabulary-directed controller translates motion tokens into executable actions, enabling robust, diverse, and physically plausible humanoid behaviors driven by free-form language. In simulations and on a real Unitree G1, Humanoid-LLA outperforms prior methods on both motion quality and physical fidelity, demonstrating practical potential for open-vocabulary humanoid control with strong language grounding.

Abstract

Enabling humanoid robots to follow free-form language commands is critical for seamless human-robot interaction, collaborative task execution, and general-purpose embodied intelligence. While recent advances have improved low-level humanoid locomotion and robot manipulation, language-conditioned whole-body control remains a significant challenge. Existing methods are often limited to simple instructions and sacrifice either motion diversity or physical plausibility. To address this, we introduce Humanoid-LLA, a Large Language Action Model that maps expressive language commands to physically executable whole-body actions for humanoid robots. Our approach integrates three core components: a unified motion vocabulary that aligns human and humanoid motion primitives into a shared discrete space; a vocabulary-directed controller distilled from a privileged policy to ensure physical feasibility; and a physics-informed fine-tuning stage using reinforcement learning with dynamics-aware rewards to enhance robustness and stability. Extensive evaluations in simulation and on a real-world Unitree G1 humanoid show that Humanoid-LLA delivers strong language generalization while maintaining high physical fidelity, outperforming existing language-conditioned controllers in motion naturalness, stability, and execution success rate.

Commanding Humanoid by Free-form Language: A Large Language Action Model with Unified Motion Vocabulary

TL;DR

<3-5 sentence high-level summary>Open-ended language-conditioned control for humanoid whole-body motion remains challenging due to data scarcity and embodiment gaps. The authors introduce Humanoid-LLA, a three-part framework that unifies human and humanoid motion into a shared discrete vocabulary via cross-embodiment VQ-VAE, distills a token-based policy from a privileged tracker, and trains a Large Language Action Model with supervised and physics-informed RL. A vocabulary-directed controller translates motion tokens into executable actions, enabling robust, diverse, and physically plausible humanoid behaviors driven by free-form language. In simulations and on a real Unitree G1, Humanoid-LLA outperforms prior methods on both motion quality and physical fidelity, demonstrating practical potential for open-vocabulary humanoid control with strong language grounding.

Abstract

Enabling humanoid robots to follow free-form language commands is critical for seamless human-robot interaction, collaborative task execution, and general-purpose embodied intelligence. While recent advances have improved low-level humanoid locomotion and robot manipulation, language-conditioned whole-body control remains a significant challenge. Existing methods are often limited to simple instructions and sacrifice either motion diversity or physical plausibility. To address this, we introduce Humanoid-LLA, a Large Language Action Model that maps expressive language commands to physically executable whole-body actions for humanoid robots. Our approach integrates three core components: a unified motion vocabulary that aligns human and humanoid motion primitives into a shared discrete space; a vocabulary-directed controller distilled from a privileged policy to ensure physical feasibility; and a physics-informed fine-tuning stage using reinforcement learning with dynamics-aware rewards to enhance robustness and stability. Extensive evaluations in simulation and on a real-world Unitree G1 humanoid show that Humanoid-LLA delivers strong language generalization while maintaining high physical fidelity, outperforming existing language-conditioned controllers in motion naturalness, stability, and execution success rate.

Paper Structure

This paper contains 25 sections, 12 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: An illustration of Humanoid-LLA. Given a high-level and abstract command (e.g., “walk in a curving figure-eight”), Humanoid-LLA first uses natural language to decompose the task (<think>...</think>), and then generates a sequence of unified motion tokens (<motion>...</motion>). A vocabulary-directed controller executes these tokens on the robot, bridging language, a unified human–humanoid motion vocabulary, and action-level control to yield physically faithful, natural whole-body behaviors.
  • Figure 2: An overview of Humanoid-LLA. In stage one, we build a unified motion vocabulary leveraging a large-scale paired human and humanoid motion dataset. With a kinematic humanoid motion goal and its corresponding vocab retrieval, we distill a vocab-directed humanoid student controller from a teacher tracking controller. The first two stages enable stage three to acquire various humanoid feedback directly from physical simulation without decoding, making our LLA enhanced with high physical fidelity and language generalization.
  • Figure 3: Real-world demonstration of free-form language-conditioned humanoid whole-body control. The tested prompts contain unseen terms ("soldier", "military parade march", "martial arts"). Benefiting from strong language understanding and motion reasoning capabilities of LLA, the humanoid performs reasonable motions even for such abstract instructions.