Table of Contents
Fetching ...

Harmon: Whole-Body Motion Generation of Humanoid Robots from Language Descriptions

Zhenyu Jiang, Yuqi Xie, Jinhan Li, Ye Yuan, Yifeng Zhu, Yuke Zhu

TL;DR

Harmon addresses language-conditioned whole-body motion generation for humanoids by leveraging large-scale human motion priors through a diffusion model to create human motions from text, followed by IK retargeting to a humanoid. A VLM-based editing stage then adds head and finger motions and iteratively refines arm movements to ensure alignment with the language description. The method demonstrates natural, expressive motions and feasibility on real robots by separating locomotion and upper-body control via a ZMP-based approach and joint-position commands. Human studies show Harmon outperforms baselines in language-motion alignment, and real-world experiments validate diverse, expressive motions, though limitations in primitive control and balance motivate future RL-based whole-body control and learned primitives.

Abstract

Humanoid robots, with their human-like embodiment, have the potential to integrate seamlessly into human environments. Critical to their coexistence and cooperation with humans is the ability to understand natural language communications and exhibit human-like behaviors. This work focuses on generating diverse whole-body motions for humanoid robots from language descriptions. We leverage human motion priors from extensive human motion datasets to initialize humanoid motions and employ the commonsense reasoning capabilities of Vision Language Models (VLMs) to edit and refine these motions. Our approach demonstrates the capability to produce natural, expressive, and text-aligned humanoid motions, validated through both simulated and real-world experiments. More videos can be found at https://ut-austin-rpl.github.io/Harmon/.

Harmon: Whole-Body Motion Generation of Humanoid Robots from Language Descriptions

TL;DR

Harmon addresses language-conditioned whole-body motion generation for humanoids by leveraging large-scale human motion priors through a diffusion model to create human motions from text, followed by IK retargeting to a humanoid. A VLM-based editing stage then adds head and finger motions and iteratively refines arm movements to ensure alignment with the language description. The method demonstrates natural, expressive motions and feasibility on real robots by separating locomotion and upper-body control via a ZMP-based approach and joint-position commands. Human studies show Harmon outperforms baselines in language-motion alignment, and real-world experiments validate diverse, expressive motions, though limitations in primitive control and balance motivate future RL-based whole-body control and learned primitives.

Abstract

Humanoid robots, with their human-like embodiment, have the potential to integrate seamlessly into human environments. Critical to their coexistence and cooperation with humans is the ability to understand natural language communications and exhibit human-like behaviors. This work focuses on generating diverse whole-body motions for humanoid robots from language descriptions. We leverage human motion priors from extensive human motion datasets to initialize humanoid motions and employ the commonsense reasoning capabilities of Vision Language Models (VLMs) to edit and refine these motions. Our approach demonstrates the capability to produce natural, expressive, and text-aligned humanoid motions, validated through both simulated and real-world experiments. More videos can be found at https://ut-austin-rpl.github.io/Harmon/.

Paper Structure

This paper contains 15 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: We generate diverse whole-body humanoid motions from free-form language descriptions and execute these motions on the real humanoid robot.
  • Figure 2: Overview of Harmon. Given the language description of a motion, we first generate corresponding human motion and retarget it to the humanoid using inverse kinematics. Next, we utilize a VLM to refine the humanoid motion. This process involves extracting finger and head motion descriptions from the initial language description and generating the corresponding motions using the VLM. Given the rendered humanoid motion, the VLM iteratively evaluates and adjusts the motion to ensure alignment with the language description. Finally, Harmon generates whole-body humanoid motion that accurately aligns with the language description.
  • Figure 3: VLM-based motion editing. Top left: GPT-4 generates finger motions at keyframes based on the rendered humanoid motions and the finger motion description. Top right: GPT-4 identifies keyframes and generates head motions from the head motion description. Bottom: GPT-4 iteratively adjusts arm motion by evaluating and refining the rendered humanoid frames based on the motion description.
  • Figure 4: Quantitative results of human study. A higher normalized score indicates a better alignment between the humanoid motion and the language description.
  • Figure 5: Qualitative results of Harmon. We highlight the generated head and finger motions with red circles and the motion adjustment with red arrows.