Semantic Co-Speech Gesture Synthesis and Real-Time Control for Humanoid Robots
Gang Zhang
TL;DR
The paper tackles the challenge of generating semantically meaningful co-speech gestures and deploying them in real time on humanoid robots. It presents an end-to-end pipeline combining General Motion Retargeting, a Residual VQ-VAE for discrete motion tokens, an autoregressive Motion-GPT conditioned on audio, and an imitation-learning MotionTracker controller for real-time execution. A key novelty is the integration of LLM-based retrieval to augment semantic gesture candidates with semantics-aware alignment, improving the naturalness and meaning of gestures. Extensive experiments validate high-fidelity motion encoding, gesture generation quality, and successful real-world deployment on the Unitree G1, indicating practical viability for expressive human-robot interaction.
Abstract
We present an innovative end-to-end framework for synthesizing semantically meaningful co-speech gestures and deploying them in real-time on a humanoid robot. This system addresses the challenge of creating natural, expressive non-verbal communication for robots by integrating advanced gesture generation techniques with robust physical control. Our core innovation lies in the meticulous integration of a semantics-aware gesture synthesis module, which derives expressive reference motions from speech input by leveraging a generative retrieval mechanism based on large language models (LLMs) and an autoregressive Motion-GPT model. This is coupled with a high-fidelity imitation learning control policy, the MotionTracker, which enables the Unitree G1 humanoid robot to execute these complex motions dynamically and maintain balance. To ensure feasibility, we employ a robust General Motion Retargeting (GMR) method to bridge the embodiment gap between human motion data and the robot platform. Through comprehensive evaluation, we demonstrate that our combined system produces semantically appropriate and rhythmically coherent gestures that are accurately tracked and executed by the physical robot. To our knowledge, this work represents a significant step toward general real-world use by providing a complete pipeline for automatic, semantic-aware, co-speech gesture generation and synchronized real-time physical deployment on a humanoid robot.
