Commanding Humanoid by Free-form Language: A Large Language Action Model with Unified Motion Vocabulary
Zhirui Liu, Kaiyang Ji, Ke Yang, Jingyi Yu, Ye Shi, Jingya Wang
TL;DR
<3-5 sentence high-level summary>Open-ended language-conditioned control for humanoid whole-body motion remains challenging due to data scarcity and embodiment gaps. The authors introduce Humanoid-LLA, a three-part framework that unifies human and humanoid motion into a shared discrete vocabulary via cross-embodiment VQ-VAE, distills a token-based policy from a privileged tracker, and trains a Large Language Action Model with supervised and physics-informed RL. A vocabulary-directed controller translates motion tokens into executable actions, enabling robust, diverse, and physically plausible humanoid behaviors driven by free-form language. In simulations and on a real Unitree G1, Humanoid-LLA outperforms prior methods on both motion quality and physical fidelity, demonstrating practical potential for open-vocabulary humanoid control with strong language grounding.
Abstract
Enabling humanoid robots to follow free-form language commands is critical for seamless human-robot interaction, collaborative task execution, and general-purpose embodied intelligence. While recent advances have improved low-level humanoid locomotion and robot manipulation, language-conditioned whole-body control remains a significant challenge. Existing methods are often limited to simple instructions and sacrifice either motion diversity or physical plausibility. To address this, we introduce Humanoid-LLA, a Large Language Action Model that maps expressive language commands to physically executable whole-body actions for humanoid robots. Our approach integrates three core components: a unified motion vocabulary that aligns human and humanoid motion primitives into a shared discrete space; a vocabulary-directed controller distilled from a privileged policy to ensure physical feasibility; and a physics-informed fine-tuning stage using reinforcement learning with dynamics-aware rewards to enhance robustness and stability. Extensive evaluations in simulation and on a real-world Unitree G1 humanoid show that Humanoid-LLA delivers strong language generalization while maintaining high physical fidelity, outperforming existing language-conditioned controllers in motion naturalness, stability, and execution success rate.
