Harmon: Whole-Body Motion Generation of Humanoid Robots from Language Descriptions
Zhenyu Jiang, Yuqi Xie, Jinhan Li, Ye Yuan, Yifeng Zhu, Yuke Zhu
TL;DR
Harmon addresses language-conditioned whole-body motion generation for humanoids by leveraging large-scale human motion priors through a diffusion model to create human motions from text, followed by IK retargeting to a humanoid. A VLM-based editing stage then adds head and finger motions and iteratively refines arm movements to ensure alignment with the language description. The method demonstrates natural, expressive motions and feasibility on real robots by separating locomotion and upper-body control via a ZMP-based approach and joint-position commands. Human studies show Harmon outperforms baselines in language-motion alignment, and real-world experiments validate diverse, expressive motions, though limitations in primitive control and balance motivate future RL-based whole-body control and learned primitives.
Abstract
Humanoid robots, with their human-like embodiment, have the potential to integrate seamlessly into human environments. Critical to their coexistence and cooperation with humans is the ability to understand natural language communications and exhibit human-like behaviors. This work focuses on generating diverse whole-body motions for humanoid robots from language descriptions. We leverage human motion priors from extensive human motion datasets to initialize humanoid motions and employ the commonsense reasoning capabilities of Vision Language Models (VLMs) to edit and refine these motions. Our approach demonstrates the capability to produce natural, expressive, and text-aligned humanoid motions, validated through both simulated and real-world experiments. More videos can be found at https://ut-austin-rpl.github.io/Harmon/.
