MMGT: Motion Mask Guided Two-Stage Network for Co-Speech Gesture Video Generation
Siyuan Wang, Jiawei Liu, Wei Wang, Yeying Jin, Jinsong Du, Zhi Han
TL;DR
Co-speech gesture video generation from audio-driven stills is challenging due to varying regional motions and the need for synchronized lip movements. The paper introduces MMGT, a two-stage framework that uses audio-derived motion masks and motion features to jointly drive gesture video synthesis without relying on extra priors. Stage I (SMGA) generates a pose video $\hat{V}_P$ and a motion mask $\hat{V}_M$ from audio $a$ and an initial pose, while Stage II embeds Motion Masked Hierarchical Audio Attention (MM-HAA) inside a Stabilized Diffusion framework to produce a high-quality final video $\hat{V}$ with accurate texture and regional details. Experiments show improved video quality, lip-sync, and gesture realism compared with prior methods, demonstrating practical gains from using only audio and a single reference image.
Abstract
Co-Speech Gesture Video Generation aims to generate vivid speech videos from audio-driven still images, which is challenging due to the diversity of different parts of the body in terms of amplitude of motion, audio relevance, and detailed features. Relying solely on audio as the control signal often fails to capture large gesture movements in video, leading to more pronounced artifacts and distortions. Existing approaches typically address this issue by introducing additional a priori information, but this can limit the practical application of the task. Specifically, we propose a Motion Mask-Guided Two-Stage Network (MMGT) that uses audio, as well as motion masks and motion features generated from the audio signal to jointly drive the generation of synchronized speech gesture videos. In the first stage, the Spatial Mask-Guided Audio Pose Generation (SMGA) Network generates high-quality pose videos and motion masks from audio, effectively capturing large movements in key regions such as the face and gestures. In the second stage, we integrate the Motion Masked Hierarchical Audio Attention (MM-HAA) into the Stabilized Diffusion Video Generation model, overcoming limitations in fine-grained motion generation and region-specific detail control found in traditional methods. This guarantees high-quality, detailed upper-body video generation with accurate texture and motion details. Evaluations show improved video quality, lip-sync, and gesture. The model and code are available at https://github.com/SIA-IDE/MMGT.
