Table of Contents
Fetching ...

VarGes: Improving Variation in Co-Speech 3D Gesture Generation via StyleCLIPS

Ming Meng, Ke Mu, Yonggui Zhu, Zhe Zhu, Haoyu Sun, Heyang Yan, Zhaoxin Fan

TL;DR

VarGes tackles the limited diversity of co-speech 3D gesture generation by integrating visual stylistic cues from style-reference videos into an audio-driven pipeline. It introduces three components—VEFE for enriched feature extraction, VCSE for robust style encoding, and VDGP for cross-attentive, autoregressive gesture prediction with VQ-VAE quantization. Through experiments on the SHOW dataset, VarGes achieves superior gesture variation and realism while maintaining strong audio-gesture synchronization, outperforming state-of-the-art methods and demonstrating the value of style-conditioned multimodal fusion. The approach enhances naturalness and expressiveness of animated characters, with potential impact on HCI, VR, and animation pipelines.

Abstract

Generating expressive and diverse human gestures from audio is crucial in fields like human-computer interaction, virtual reality, and animation. Though existing methods have achieved remarkable performance, they often exhibit limitations due to constrained dataset diversity and the restricted amount of information derived from audio inputs. To address these challenges, we present VarGes, a novel variation-driven framework designed to enhance co-speech gesture generation by integrating visual stylistic cues while maintaining naturalness. Our approach begins with the Variation-Enhanced Feature Extraction (VEFE) module, which seamlessly incorporates \textcolor{blue}{style-reference} video data into a 3D human pose estimation network to extract StyleCLIPS, thereby enriching the input with stylistic information. Subsequently, we employ the Variation-Compensation Style Encoder (VCSE), a transformer-style encoder equipped with an additive attention mechanism pooling layer, to robustly encode diverse StyleCLIPS representations and effectively manage stylistic variations. Finally, the Variation-Driven Gesture Predictor (VDGP) module fuses MFCC audio features with StyleCLIPS encodings via cross-attention, injecting this fused data into a cross-conditional autoregressive model to modulate 3D human gesture generation based on audio input and stylistic clues. The efficacy of our approach is validated on benchmark datasets, where it outperforms existing methods in terms of gesture diversity and naturalness. The code and video results will be made publicly available upon acceptance:https://github.com/mookerr/VarGES/ .

VarGes: Improving Variation in Co-Speech 3D Gesture Generation via StyleCLIPS

TL;DR

VarGes tackles the limited diversity of co-speech 3D gesture generation by integrating visual stylistic cues from style-reference videos into an audio-driven pipeline. It introduces three components—VEFE for enriched feature extraction, VCSE for robust style encoding, and VDGP for cross-attentive, autoregressive gesture prediction with VQ-VAE quantization. Through experiments on the SHOW dataset, VarGes achieves superior gesture variation and realism while maintaining strong audio-gesture synchronization, outperforming state-of-the-art methods and demonstrating the value of style-conditioned multimodal fusion. The approach enhances naturalness and expressiveness of animated characters, with potential impact on HCI, VR, and animation pipelines.

Abstract

Generating expressive and diverse human gestures from audio is crucial in fields like human-computer interaction, virtual reality, and animation. Though existing methods have achieved remarkable performance, they often exhibit limitations due to constrained dataset diversity and the restricted amount of information derived from audio inputs. To address these challenges, we present VarGes, a novel variation-driven framework designed to enhance co-speech gesture generation by integrating visual stylistic cues while maintaining naturalness. Our approach begins with the Variation-Enhanced Feature Extraction (VEFE) module, which seamlessly incorporates \textcolor{blue}{style-reference} video data into a 3D human pose estimation network to extract StyleCLIPS, thereby enriching the input with stylistic information. Subsequently, we employ the Variation-Compensation Style Encoder (VCSE), a transformer-style encoder equipped with an additive attention mechanism pooling layer, to robustly encode diverse StyleCLIPS representations and effectively manage stylistic variations. Finally, the Variation-Driven Gesture Predictor (VDGP) module fuses MFCC audio features with StyleCLIPS encodings via cross-attention, injecting this fused data into a cross-conditional autoregressive model to modulate 3D human gesture generation based on audio input and stylistic clues. The efficacy of our approach is validated on benchmark datasets, where it outperforms existing methods in terms of gesture diversity and naturalness. The code and video results will be made publicly available upon acceptance:https://github.com/mookerr/VarGES/ .

Paper Structure

This paper contains 22 sections, 11 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Examples of our generated gestures video frames. Compared with TalkShow, our method shows a richer variety in generating character gestures, and the gestures are more natural and smooth.
  • Figure 2: Overview of the VarGes framework. VarGes comprises three modules: The Variation-Enhanced Feature Extraction (VEFE) module extracts key features from speech using Wav2vec 2.0 and MFCC, filtering noise with StyleCLIPS from style-reference videos. The Variation-Compensation Style Encoder (VCSE) module encodes style-clips into deep feature style codes with a transformer-based encoder and self-attention pooling. The Variation-Driven Gesture Predictor (VDGP) module fuses style codes and MFCC through cross-attention and a temporal autoregressive network, incorporating identity information to boost gesture diversity and naturalness. Action quantization is applied during training to further increase action variability.
  • Figure 3: t-SNE visualization of style code Distribution. This figure illustrates the t-SNE visualization of the Style Code learned from videos associated with four different IDs.
  • Figure 4: Visualization of the same audio with different reference-style videos. The figure illustrates the gesture generation results of our model when provided with identical audio input and different style-reference videos. The generated gestures exhibit synchronization with the audio while adapting to the distinct stylistic characteristics of each reference video, demonstrating the model's ability to achieve both diversity and naturalness in gesture generation.
  • Figure 5: Ground truth comparison and 3D mesh visualization with StyleCLIPS. A side-by-side comparison between the original video frames and the corresponding 3D meshes generated using StyleCLIPS.
  • ...and 3 more figures