Joint Co-Speech Gesture and Expressive Talking Face Generation using Diffusion with Adapters
Steven Hogue, Chenxu Zhang, Yapeng Tian, Xiaohu Guo
TL;DR
This work tackles the task of jointly generating co-speech gestures and expressive talking heads from audio using a diffusion model with adapter modules. By introducing cross-modal adapters within a shared transformer backbone, the approach enables face and body to influence each other while sharing latent representations, reducing parameter counts relative to separate networks. It reports state-of-the-art or competitive results in both gesture realism and facial expressiveness, validated through qualitative examples, quantitative metrics, and a user study. The method offers practical benefits for avatar realism in AI chat systems and virtual communication, with avenues for seed-free inference and broader dataset coverage in future work.
Abstract
Recent advances in co-speech gesture and talking head generation have been impressive, yet most methods focus on only one of the two tasks. Those that attempt to generate both often rely on separate models or network modules, increasing training complexity and ignoring the inherent relationship between face and body movements. To address the challenges, in this paper, we propose a novel model architecture that jointly generates face and body motions within a single network. This approach leverages shared weights between modalities, facilitated by adapters that enable adaptation to a common latent space. Our experiments demonstrate that the proposed framework not only maintains state-of-the-art co-speech gesture and talking head generation performance but also significantly reduces the number of parameters required.
