Table of Contents
Fetching ...

SIGGesture: Generalized Co-Speech Gesture Synthesis via Semantic Injection with Large-Scale Pre-Training Diffusion Models

Qingrong Cheng, Xu Li, Xinghui Fu, Fei Xia, Zhongqian Sun

TL;DR

This work introduces SIGGesture, a novel diffusion-based approach for synthesizing realistic gestures that are both high-quality and semantically pertinent, and proposes a semantic injection module to infuse semantic information into the synthesized results during the diffusion reverse process.

Abstract

The automated synthesis of high-quality 3D gestures from speech is of significant value in virtual humans and gaming. Previous methods focus on synthesizing gestures that are synchronized with speech rhythm, yet they frequently overlook the inclusion of semantic gestures. These are sparse and follow a long-tailed distribution across the gesture sequence, making them difficult to learn in an end-to-end manner. Moreover, generating gestures, rhythmically aligned with speech, faces a significant issue that cannot be generalized to in-the-wild speeches. To address these issues, we introduce SIGGesture, a novel diffusion-based approach for synthesizing realistic gestures that are of both high quality and semantically pertinent. Specifically, we firstly build a strong diffusion-based foundation model for rhythmical gesture synthesis by pre-training it on a collected large-scale dataset with pseudo labels. Secondly, we leverage the powerful generalization capabilities of Large Language Models (LLMs) to generate proper semantic gestures for the various speech content. Finally, we propose a semantic injection module to infuse semantic information into the synthesized results during diffusion reverse process. Extensive experiments demonstrate that the proposed SIGGesture significantly outperforms existing baselines and shows excellent generalization and controllability.

SIGGesture: Generalized Co-Speech Gesture Synthesis via Semantic Injection with Large-Scale Pre-Training Diffusion Models

TL;DR

This work introduces SIGGesture, a novel diffusion-based approach for synthesizing realistic gestures that are both high-quality and semantically pertinent, and proposes a semantic injection module to infuse semantic information into the synthesized results during the diffusion reverse process.

Abstract

The automated synthesis of high-quality 3D gestures from speech is of significant value in virtual humans and gaming. Previous methods focus on synthesizing gestures that are synchronized with speech rhythm, yet they frequently overlook the inclusion of semantic gestures. These are sparse and follow a long-tailed distribution across the gesture sequence, making them difficult to learn in an end-to-end manner. Moreover, generating gestures, rhythmically aligned with speech, faces a significant issue that cannot be generalized to in-the-wild speeches. To address these issues, we introduce SIGGesture, a novel diffusion-based approach for synthesizing realistic gestures that are of both high quality and semantically pertinent. Specifically, we firstly build a strong diffusion-based foundation model for rhythmical gesture synthesis by pre-training it on a collected large-scale dataset with pseudo labels. Secondly, we leverage the powerful generalization capabilities of Large Language Models (LLMs) to generate proper semantic gestures for the various speech content. Finally, we propose a semantic injection module to infuse semantic information into the synthesized results during diffusion reverse process. Extensive experiments demonstrate that the proposed SIGGesture significantly outperforms existing baselines and shows excellent generalization and controllability.
Paper Structure (31 sections, 16 equations, 8 figures, 5 tables)

This paper contains 31 sections, 16 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: The framework of the proposed method is illustrated in two parts. The upper section details the training processes of diffusion and denoising. The lower part demonstrates the inference process for synthesizing gestures based on the given conditions. Specifically, the audio features and speaker identity are directly fed into the denoising network. Meanwhile, the speech textual content are used as input in the interaction with the LLM, which generates a set of candidate semantic gestures for the subsequent semantic injection process.
  • Figure 2: The statistical data of semantic gesture dataset (upper left part), some examples of semantic label (lower left part), and some semantic gesture examples (right part).
  • Figure 3: Visualization results between the proposed method and other state-of-the-art methods.
  • Figure 4: The visualization results of the proposed method. The gestures in the red box are semantic gesture, which is labeled by the content in green box.
  • Figure 5: The visual results of the proposed method on in-the-wild speeches (English, Chinese, and Japanese). The gestures in the red box are semantic gesture, which is labeled by the content in green box.
  • ...and 3 more figures