Table of Contents
Fetching ...

AsynFusion: Towards Asynchronous Latent Consistency Models for Decoupled Whole-Body Audio-Driven Avatars

Tianbao Zhang, Jian Zhao, Yuer Li, Zheng Zhu, Ping Hu, Zhaoxin Fan, Wenjun Wu, Xuelong Li

TL;DR

AsynFusion tackles the lack of coordination between audio-driven facial expressions and body gestures by introducing a dual-branch diffusion-transformer framework. Its Cooperative Synchronization Module enables bidirectional inter-modal interaction, while asynchronous Latent Consistency Model sampling accelerates inference, enabling real-time, synchronized whole-body avatar animation. Across BEAT and SHOW benchmarks, AsynFusion achieves state-of-the-art metrics for motion quality, diversity, and Beat Alignment, demonstrating superior coherence between expressions and gestures. The work highlights the practical potential of asynchronous, bidirectional diffusion for lifelike digital humans and points to data quality and large-scale multimodal pretraining as key avenues for future improvements.

Abstract

Whole-body audio-driven avatar pose and expression generation is a critical task for creating lifelike digital humans and enhancing the capabilities of interactive virtual agents, with wide-ranging applications in virtual reality, digital entertainment, and remote communication. Existing approaches often generate audio-driven facial expressions and gestures independently, which introduces a significant limitation: the lack of seamless coordination between facial and gestural elements, resulting in less natural and cohesive animations. To address this limitation, we propose AsynFusion, a novel framework that leverages diffusion transformers to achieve harmonious expression and gesture synthesis. The proposed method is built upon a dual-branch DiT architecture, which enables the parallel generation of facial expressions and gestures. Within the model, we introduce a Cooperative Synchronization Module to facilitate bidirectional feature interaction between the two modalities, and an Asynchronous LCM Sampling strategy to reduce computational overhead while maintaining high-quality outputs. Extensive experiments demonstrate that AsynFusion achieves state-of-the-art performance in generating real-time, synchronized whole-body animations, consistently outperforming existing methods in both quantitative and qualitative evaluations.

AsynFusion: Towards Asynchronous Latent Consistency Models for Decoupled Whole-Body Audio-Driven Avatars

TL;DR

AsynFusion tackles the lack of coordination between audio-driven facial expressions and body gestures by introducing a dual-branch diffusion-transformer framework. Its Cooperative Synchronization Module enables bidirectional inter-modal interaction, while asynchronous Latent Consistency Model sampling accelerates inference, enabling real-time, synchronized whole-body avatar animation. Across BEAT and SHOW benchmarks, AsynFusion achieves state-of-the-art metrics for motion quality, diversity, and Beat Alignment, demonstrating superior coherence between expressions and gestures. The work highlights the practical potential of asynchronous, bidirectional diffusion for lifelike digital humans and points to data quality and large-scale multimodal pretraining as key avenues for future improvements.

Abstract

Whole-body audio-driven avatar pose and expression generation is a critical task for creating lifelike digital humans and enhancing the capabilities of interactive virtual agents, with wide-ranging applications in virtual reality, digital entertainment, and remote communication. Existing approaches often generate audio-driven facial expressions and gestures independently, which introduces a significant limitation: the lack of seamless coordination between facial and gestural elements, resulting in less natural and cohesive animations. To address this limitation, we propose AsynFusion, a novel framework that leverages diffusion transformers to achieve harmonious expression and gesture synthesis. The proposed method is built upon a dual-branch DiT architecture, which enables the parallel generation of facial expressions and gestures. Within the model, we introduce a Cooperative Synchronization Module to facilitate bidirectional feature interaction between the two modalities, and an Asynchronous LCM Sampling strategy to reduce computational overhead while maintaining high-quality outputs. Extensive experiments demonstrate that AsynFusion achieves state-of-the-art performance in generating real-time, synchronized whole-body animations, consistently outperforming existing methods in both quantitative and qualitative evaluations.

Paper Structure

This paper contains 18 sections, 19 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Comparison of Different Audio-Driven Avatar Generation Frameworks. The upper section presents the three mainstream frameworks, while the lower section introduces our proposed AsynFusion which enables bidirectional feature interaction between the face and body generators and supports asynchronous sampling for more efficient generation.
  • Figure 2: Overview of AsynFusion. The framework (a) consists of a Dual-branch DiT architecture (blue and green) with a CoSync module for bidirectional feature interaction between expression and gesture branches, utilizing $F_i^{exp}$ and $F_i^{ges}$. (b) is the inference scheduler of AsynFusion.
  • Figure 3: Visualization of generated motions for the speech. The red arrows indicate how the gestures and facial expressions are well-coordinated during the greeting motion.