Table of Contents
Fetching ...

Joint Co-Speech Gesture and Expressive Talking Face Generation using Diffusion with Adapters

Steven Hogue, Chenxu Zhang, Yapeng Tian, Xiaohu Guo

TL;DR

This work tackles the task of jointly generating co-speech gestures and expressive talking heads from audio using a diffusion model with adapter modules. By introducing cross-modal adapters within a shared transformer backbone, the approach enables face and body to influence each other while sharing latent representations, reducing parameter counts relative to separate networks. It reports state-of-the-art or competitive results in both gesture realism and facial expressiveness, validated through qualitative examples, quantitative metrics, and a user study. The method offers practical benefits for avatar realism in AI chat systems and virtual communication, with avenues for seed-free inference and broader dataset coverage in future work.

Abstract

Recent advances in co-speech gesture and talking head generation have been impressive, yet most methods focus on only one of the two tasks. Those that attempt to generate both often rely on separate models or network modules, increasing training complexity and ignoring the inherent relationship between face and body movements. To address the challenges, in this paper, we propose a novel model architecture that jointly generates face and body motions within a single network. This approach leverages shared weights between modalities, facilitated by adapters that enable adaptation to a common latent space. Our experiments demonstrate that the proposed framework not only maintains state-of-the-art co-speech gesture and talking head generation performance but also significantly reduces the number of parameters required.

Joint Co-Speech Gesture and Expressive Talking Face Generation using Diffusion with Adapters

TL;DR

This work tackles the task of jointly generating co-speech gestures and expressive talking heads from audio using a diffusion model with adapter modules. By introducing cross-modal adapters within a shared transformer backbone, the approach enables face and body to influence each other while sharing latent representations, reducing parameter counts relative to separate networks. It reports state-of-the-art or competitive results in both gesture realism and facial expressiveness, validated through qualitative examples, quantitative metrics, and a user study. The method offers practical benefits for avatar realism in AI chat systems and virtual communication, with avenues for seed-free inference and broader dataset coverage in future work.

Abstract

Recent advances in co-speech gesture and talking head generation have been impressive, yet most methods focus on only one of the two tasks. Those that attempt to generate both often rely on separate models or network modules, increasing training complexity and ignoring the inherent relationship between face and body movements. To address the challenges, in this paper, we propose a novel model architecture that jointly generates face and body motions within a single network. This approach leverages shared weights between modalities, facilitated by adapters that enable adaptation to a common latent space. Our experiments demonstrate that the proposed framework not only maintains state-of-the-art co-speech gesture and talking head generation performance but also significantly reduces the number of parameters required.

Paper Structure

This paper contains 24 sections, 6 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Framework and Architecture Overview. The top left shows the general network architecture, with inputs of a face motion parameters and body parameters being fed into separate projection layers, combined with conditional information $c_t$ from the bottom left, and the noisy latents $x_t$. The transformer network (in green) denoises the latents, which are fed into the network separately. The right is the architecture of our transformer block with adapters. The green transformer blocks share one set of parameters for both branches.
  • Figure 2: Qualitative Comparison. We compare sequences of motions for our method, TalkSHOW and DiffGesture. Our motions are more diverse and dynamic compared to the baselines.
  • Figure 3: Qualitative Comparison for Ablation Study. Our model produces smooth dynamic motion while the alternative architectures generate jittery motions that move little from the mean position.