Co-speech Gesture Video Generation via Motion-Based Graph Retrieval
Yafei Song, Peng Zhang, Bang Zhang
TL;DR
This work tackles co-speech gesture video generation by coupling a diffusion-model–based gesture generator with a motion-graph retrieval framework. It learns a joint audio-motion distribution using multi-modal audio cues and then retrieves and stitches video segments from a pre-constructed motion graph to form coherent gestures. Quantitative and qualitative results demonstrate state-of-the-art synchronization and visual quality, highlighting the benefits of integrating diffusion priors with graph-based retrieval. Limitations include reliance on motion-graph data and occasional transition artifacts, with future work aiming at few-shot graph adaptation and improved transition modeling.
Abstract
Synthesizing synchronized and natural co-speech gesture videos remains a formidable challenge. Recent approaches have leveraged motion graphs to harness the potential of existing video data. To retrieve an appropriate trajectory from the graph, previous methods either utilize the distance between features extracted from the input audio and those associated with the motions in the graph or embed both the input audio and motion into a shared feature space. However, these techniques may not be optimal due to the many-to-many mapping nature between audio and gestures, which cannot be adequately addressed by one-to-one mapping. To alleviate this limitation, we propose a novel framework that initially employs a diffusion model to generate gesture motions. The diffusion model implicitly learns the joint distribution of audio and motion, enabling the generation of contextually appropriate gestures from input audio sequences. Furthermore, our method extracts both low-level and high-level features from the input audio to enrich the training process of the diffusion model. Subsequently, a meticulously designed motion-based retrieval algorithm is applied to identify the most suitable path within the graph by assessing both global and local similarities in motion. Given that not all nodes in the retrieved path are sequentially continuous, the final step involves seamlessly stitching together these segments to produce a coherent video output. Experimental results substantiate the efficacy of our proposed method, demonstrating a significant improvement over prior approaches in terms of synchronization accuracy and naturalness of generated gestures.
