Table of Contents
Fetching ...

Co-speech Gesture Video Generation via Motion-Based Graph Retrieval

Yafei Song, Peng Zhang, Bang Zhang

TL;DR

This work tackles co-speech gesture video generation by coupling a diffusion-model–based gesture generator with a motion-graph retrieval framework. It learns a joint audio-motion distribution using multi-modal audio cues and then retrieves and stitches video segments from a pre-constructed motion graph to form coherent gestures. Quantitative and qualitative results demonstrate state-of-the-art synchronization and visual quality, highlighting the benefits of integrating diffusion priors with graph-based retrieval. Limitations include reliance on motion-graph data and occasional transition artifacts, with future work aiming at few-shot graph adaptation and improved transition modeling.

Abstract

Synthesizing synchronized and natural co-speech gesture videos remains a formidable challenge. Recent approaches have leveraged motion graphs to harness the potential of existing video data. To retrieve an appropriate trajectory from the graph, previous methods either utilize the distance between features extracted from the input audio and those associated with the motions in the graph or embed both the input audio and motion into a shared feature space. However, these techniques may not be optimal due to the many-to-many mapping nature between audio and gestures, which cannot be adequately addressed by one-to-one mapping. To alleviate this limitation, we propose a novel framework that initially employs a diffusion model to generate gesture motions. The diffusion model implicitly learns the joint distribution of audio and motion, enabling the generation of contextually appropriate gestures from input audio sequences. Furthermore, our method extracts both low-level and high-level features from the input audio to enrich the training process of the diffusion model. Subsequently, a meticulously designed motion-based retrieval algorithm is applied to identify the most suitable path within the graph by assessing both global and local similarities in motion. Given that not all nodes in the retrieved path are sequentially continuous, the final step involves seamlessly stitching together these segments to produce a coherent video output. Experimental results substantiate the efficacy of our proposed method, demonstrating a significant improvement over prior approaches in terms of synchronization accuracy and naturalness of generated gestures.

Co-speech Gesture Video Generation via Motion-Based Graph Retrieval

TL;DR

This work tackles co-speech gesture video generation by coupling a diffusion-model–based gesture generator with a motion-graph retrieval framework. It learns a joint audio-motion distribution using multi-modal audio cues and then retrieves and stitches video segments from a pre-constructed motion graph to form coherent gestures. Quantitative and qualitative results demonstrate state-of-the-art synchronization and visual quality, highlighting the benefits of integrating diffusion priors with graph-based retrieval. Limitations include reliance on motion-graph data and occasional transition artifacts, with future work aiming at few-shot graph adaptation and improved transition modeling.

Abstract

Synthesizing synchronized and natural co-speech gesture videos remains a formidable challenge. Recent approaches have leveraged motion graphs to harness the potential of existing video data. To retrieve an appropriate trajectory from the graph, previous methods either utilize the distance between features extracted from the input audio and those associated with the motions in the graph or embed both the input audio and motion into a shared feature space. However, these techniques may not be optimal due to the many-to-many mapping nature between audio and gestures, which cannot be adequately addressed by one-to-one mapping. To alleviate this limitation, we propose a novel framework that initially employs a diffusion model to generate gesture motions. The diffusion model implicitly learns the joint distribution of audio and motion, enabling the generation of contextually appropriate gestures from input audio sequences. Furthermore, our method extracts both low-level and high-level features from the input audio to enrich the training process of the diffusion model. Subsequently, a meticulously designed motion-based retrieval algorithm is applied to identify the most suitable path within the graph by assessing both global and local similarities in motion. Given that not all nodes in the retrieved path are sequentially continuous, the final step involves seamlessly stitching together these segments to produce a coherent video output. Experimental results substantiate the efficacy of our proposed method, demonstrating a significant improvement over prior approaches in terms of synchronization accuracy and naturalness of generated gestures.

Paper Structure

This paper contains 13 sections, 18 equations, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: Our key idea is to generate the motion sequence conditioned on the input audio using a diffusion model and then retrieve its nearest trajectory from the pre-constructed motion graph.
  • Figure 2: The pipeline of our method. We first train a transformer-based denoising network to generate motion conditioned on the audio, then construct a motion graph using the existing video data. By defining a hybrid motion similarity metric, we could retrieve the optimal trajectory from the motion graph. Combining all nodes in the trajectory, we could get the final video.
  • Figure 3: Qualitative comparison showing our method's ability to generate context-specific gestures (e.g., raised arms during emphatic speech). As images hardly represent temporal information, please refer to the videos in the supplement for better observation.