Table of Contents
Fetching ...

Semantic Co-Speech Gesture Synthesis and Real-Time Control for Humanoid Robots

Gang Zhang

TL;DR

The paper tackles the challenge of generating semantically meaningful co-speech gestures and deploying them in real time on humanoid robots. It presents an end-to-end pipeline combining General Motion Retargeting, a Residual VQ-VAE for discrete motion tokens, an autoregressive Motion-GPT conditioned on audio, and an imitation-learning MotionTracker controller for real-time execution. A key novelty is the integration of LLM-based retrieval to augment semantic gesture candidates with semantics-aware alignment, improving the naturalness and meaning of gestures. Extensive experiments validate high-fidelity motion encoding, gesture generation quality, and successful real-world deployment on the Unitree G1, indicating practical viability for expressive human-robot interaction.

Abstract

We present an innovative end-to-end framework for synthesizing semantically meaningful co-speech gestures and deploying them in real-time on a humanoid robot. This system addresses the challenge of creating natural, expressive non-verbal communication for robots by integrating advanced gesture generation techniques with robust physical control. Our core innovation lies in the meticulous integration of a semantics-aware gesture synthesis module, which derives expressive reference motions from speech input by leveraging a generative retrieval mechanism based on large language models (LLMs) and an autoregressive Motion-GPT model. This is coupled with a high-fidelity imitation learning control policy, the MotionTracker, which enables the Unitree G1 humanoid robot to execute these complex motions dynamically and maintain balance. To ensure feasibility, we employ a robust General Motion Retargeting (GMR) method to bridge the embodiment gap between human motion data and the robot platform. Through comprehensive evaluation, we demonstrate that our combined system produces semantically appropriate and rhythmically coherent gestures that are accurately tracked and executed by the physical robot. To our knowledge, this work represents a significant step toward general real-world use by providing a complete pipeline for automatic, semantic-aware, co-speech gesture generation and synchronized real-time physical deployment on a humanoid robot.

Semantic Co-Speech Gesture Synthesis and Real-Time Control for Humanoid Robots

TL;DR

The paper tackles the challenge of generating semantically meaningful co-speech gestures and deploying them in real time on humanoid robots. It presents an end-to-end pipeline combining General Motion Retargeting, a Residual VQ-VAE for discrete motion tokens, an autoregressive Motion-GPT conditioned on audio, and an imitation-learning MotionTracker controller for real-time execution. A key novelty is the integration of LLM-based retrieval to augment semantic gesture candidates with semantics-aware alignment, improving the naturalness and meaning of gestures. Extensive experiments validate high-fidelity motion encoding, gesture generation quality, and successful real-world deployment on the Unitree G1, indicating practical viability for expressive human-robot interaction.

Abstract

We present an innovative end-to-end framework for synthesizing semantically meaningful co-speech gestures and deploying them in real-time on a humanoid robot. This system addresses the challenge of creating natural, expressive non-verbal communication for robots by integrating advanced gesture generation techniques with robust physical control. Our core innovation lies in the meticulous integration of a semantics-aware gesture synthesis module, which derives expressive reference motions from speech input by leveraging a generative retrieval mechanism based on large language models (LLMs) and an autoregressive Motion-GPT model. This is coupled with a high-fidelity imitation learning control policy, the MotionTracker, which enables the Unitree G1 humanoid robot to execute these complex motions dynamically and maintain balance. To ensure feasibility, we employ a robust General Motion Retargeting (GMR) method to bridge the embodiment gap between human motion data and the robot platform. Through comprehensive evaluation, we demonstrate that our combined system produces semantically appropriate and rhythmically coherent gestures that are accurately tracked and executed by the physical robot. To our knowledge, this work represents a significant step toward general real-world use by providing a complete pipeline for automatic, semantic-aware, co-speech gesture generation and synchronized real-time physical deployment on a humanoid robot.

Paper Structure

This paper contains 19 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: System Overview: Training and Inference Pipeline.
  • Figure 2: Comparison of Original vs. Reconstructed G1 Motion (Joint Dimensions 0-5). The reconstructed motion (red dashed line) closely follows the original motion (blue solid line) over time.
  • Figure 3: G1 Motion Comparison: Ground Truth vs. Generated (Selected Joints). The generated motion shows a strong correlation and low MSE with the Ground Truth data.
  • Figure 4: Real-World Deployment Demonstration on Unitree G1. The robot executes generated co-speech gestures while maintaining balance.