Table of Contents
Fetching ...

EasyGenNet: An Efficient Framework for Audio-Driven Gesture Video Generation Based on Diffusion Model

Renda Li, Xiaohua Qi, Qiang Ling, Jun Yu, Ziyi Chen, Peng Chang, Mei HanJing Xiao

TL;DR

This work tackles efficient audio-driven co-speech gesture video generation by introducing EasyGenNet, a diffusion-based one-stage framework that fine-tunes on a modest amount of data per speaker. It converts audio to a sequence of 2D skeleton maps derived from a SMPLX representation and renders photorealistic video conditioned on a reference image using a frozen Backbone Denoising Network, a fine-tuned ReferenceNet, and a Pose ControlNet, with temporal coherence achieved through All-frames Attention and temporal inference. The method achieves superior hand and gesture realism compared with GAN-based baselines and other diffusion methods, notably under out-of-domain poses, while avoiding large-scale pretraining or dedicated temporal modules. This makes practical deployment feasible for new speakers and applications requiring rapid adaptation with limited data, enabling scalable co-speech video generation in real-world settings.

Abstract

Audio-driven cospeech video generation typically involves two stages: speech-to-gesture and gesture-to-video. While significant advances have been made in speech-to-gesture generation, synthesizing natural expressions and gestures remains challenging in gesture-to-video systems. In order to improve the generation effect, previous works adopted complex input and training strategies and required a large amount of data sets for pre-training, which brought inconvenience to practical applications. We propose a simple one-stage training method and a temporal inference method based on a diffusion model to synthesize realistic and continuous gesture videos without the need for additional training of temporal modules.The entire model makes use of existing pre-trained weights, and only a few thousand frames of data are needed for each character at a time to complete fine-tuning. Built upon the video generator, we introduce a new audio-to-video pipeline to synthesize co-speech videos, using 2D human skeleton as the intermediate motion representation. Our experiments show that our method outperforms existing GAN-based and diffusion-based methods.

EasyGenNet: An Efficient Framework for Audio-Driven Gesture Video Generation Based on Diffusion Model

TL;DR

This work tackles efficient audio-driven co-speech gesture video generation by introducing EasyGenNet, a diffusion-based one-stage framework that fine-tunes on a modest amount of data per speaker. It converts audio to a sequence of 2D skeleton maps derived from a SMPLX representation and renders photorealistic video conditioned on a reference image using a frozen Backbone Denoising Network, a fine-tuned ReferenceNet, and a Pose ControlNet, with temporal coherence achieved through All-frames Attention and temporal inference. The method achieves superior hand and gesture realism compared with GAN-based baselines and other diffusion methods, notably under out-of-domain poses, while avoiding large-scale pretraining or dedicated temporal modules. This makes practical deployment feasible for new speakers and applications requiring rapid adaptation with limited data, enabling scalable co-speech video generation in real-world settings.

Abstract

Audio-driven cospeech video generation typically involves two stages: speech-to-gesture and gesture-to-video. While significant advances have been made in speech-to-gesture generation, synthesizing natural expressions and gestures remains challenging in gesture-to-video systems. In order to improve the generation effect, previous works adopted complex input and training strategies and required a large amount of data sets for pre-training, which brought inconvenience to practical applications. We propose a simple one-stage training method and a temporal inference method based on a diffusion model to synthesize realistic and continuous gesture videos without the need for additional training of temporal modules.The entire model makes use of existing pre-trained weights, and only a few thousand frames of data are needed for each character at a time to complete fine-tuning. Built upon the video generator, we introduce a new audio-to-video pipeline to synthesize co-speech videos, using 2D human skeleton as the intermediate motion representation. Our experiments show that our method outperforms existing GAN-based and diffusion-based methods.

Paper Structure

This paper contains 25 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Given an audio segment, a speaker ID, and a reference image of the speaker, our co-speech gesture video generation framework can produce videos with realistic appearances, vivid facial expressions, and clear hand gestures. Here, we showcase several generated video frames for two characters.
  • Figure 2: Above chart illustrates skeleton maps generated with two methods: inferring GT images with OpenPose model ($3$rd column), and our method which extracts SMPLX model parameters then mapping them into joints under the OpenPose configuration ($4$th column). Our method yields more accurate hand poses, finger configurations and body shapes.
  • Figure 3: Given an input audio $\mathcal{A}_t$, we first train a network to predict skeleton sequences that match speech, as denoted as $\mathcal{P}_t$ for each frame in Section \ref{['sec:audio2gesture']}. Additionally, leveraging the pre-trained SD 1.5 model and Pose ControlNet, we designed a single-stage fine-tuned video generation model called EasyGenNet. Given a skeleton sequence $\mathcal{P}_t$ and a reference image as conditions, our model generates videos that align with the appearance of the reference image while matching the poses to the skeleton sequence.
  • Figure 4: Qualitative comparison between our method and the GAN baseline. In the in-domain test set (upper half), our method outperforms the baseline by generating clearer hands and more realistic facial expressions. In the out-of-domain test set (lower half), the baseline model fails to adapt to body shifts and new hand positions relative to the training set, resulting in incorrect appearances, while our method consistently generates accurate and high-quality images.
  • Figure 5: Generation results on the original skeleton map (left side) and on the skeleton map after increasing the camera focal length (right side) for both the Baseline method and our method. After the skeleton maps were enlarged, our method not only continued to generate clear hands but also produced correct head and body shapes compared to the Baseline.
  • ...and 1 more figures