Table of Contents
Fetching ...

ExpGest: Expressive Speaker Generation Using Diffusion Model and Hybrid Audio-Text Guidance

Yongkang Cheng, Mingjiang Liang, Shaoli Huang, Jifeng Ning, Wei Liu

TL;DR

This work introduces ExpGest, a novel framework leveraging synchronized text and audio information to generate expressive full-body gestures, and designs a noise emotion classifier for optimizing adversarial direction noise, avoiding melody distortion and guiding results towards specified emotions.

Abstract

Existing gesture generation methods primarily focus on upper body gestures based on audio features, neglecting speech content, emotion, and locomotion. These limitations result in stiff, mechanical gestures that fail to convey the true meaning of audio content. We introduce ExpGest, a novel framework leveraging synchronized text and audio information to generate expressive full-body gestures. Unlike AdaIN or one-hot encoding methods, we design a noise emotion classifier for optimizing adversarial direction noise, avoiding melody distortion and guiding results towards specified emotions. Moreover, aligning semantic and gestures in the latent space provides better generalization capabilities. ExpGest, a diffusion model-based gesture generation framework, is the first attempt to offer mixed generation modes, including audio-driven gestures and text-shaped motion. Experiments show that our framework effectively learns from combined text-driven motion and audio-induced gesture datasets, and preliminary results demonstrate that ExpGest achieves more expressive, natural, and controllable global motion in speakers compared to state-of-the-art models.

ExpGest: Expressive Speaker Generation Using Diffusion Model and Hybrid Audio-Text Guidance

TL;DR

This work introduces ExpGest, a novel framework leveraging synchronized text and audio information to generate expressive full-body gestures, and designs a noise emotion classifier for optimizing adversarial direction noise, avoiding melody distortion and guiding results towards specified emotions.

Abstract

Existing gesture generation methods primarily focus on upper body gestures based on audio features, neglecting speech content, emotion, and locomotion. These limitations result in stiff, mechanical gestures that fail to convey the true meaning of audio content. We introduce ExpGest, a novel framework leveraging synchronized text and audio information to generate expressive full-body gestures. Unlike AdaIN or one-hot encoding methods, we design a noise emotion classifier for optimizing adversarial direction noise, avoiding melody distortion and guiding results towards specified emotions. Moreover, aligning semantic and gestures in the latent space provides better generalization capabilities. ExpGest, a diffusion model-based gesture generation framework, is the first attempt to offer mixed generation modes, including audio-driven gestures and text-shaped motion. Experiments show that our framework effectively learns from combined text-driven motion and audio-induced gesture datasets, and preliminary results demonstrate that ExpGest achieves more expressive, natural, and controllable global motion in speakers compared to state-of-the-art models.

Paper Structure

This paper contains 13 sections, 8 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Our method demonstrates multimodal-driven effects. The first row showcases the emotional control capability of gestures driven by audio alone, while the second row exhibits the motion style transferability driven by a combination of phrases and audio. The third row presents the results of long-frame textual descriptions and audio jointly driving the process.
  • Figure 2: Architecture Diagram. The upper part is the denoising model GDM. Noise step $T$, along with pure Gaussian noise and conditions (text description and audio), is fed into the model as input sequences. The lower part is the sampling step, where we predict $x'_{0}$ through the denoising process and add noise to $x_{t-1}$ via the diffusion process. Subsequently, $x_{t-1}$ is input into the noise emotion classifier to optimize the noise at that step, and the optimized noise is then passed back to the GDM. This cycle continues until t=T becomes t=0.
  • Figure 3: Semantic Alignment Module. We employ contrastive learning to encode gestures and audio semantics into a shared latent space and achieve alignment in the latent space.
  • Figure 4: The first row compares our method's generative performance in an audio-guided scenario with state-of-the-art techniques, yielding more expressive gestures. The second row demonstrates a speaker's generation with rich movements, guided by audio combined with action or text. See Supp for more videos.