Table of Contents
Fetching ...

FADA: Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation

Tianyun Zhong, Chao Liang, Jianwen Jiang, Gaojie Lin, Jiaqi Yang, Zhou Zhao

TL;DR

FADA tackles the bottleneck of slow diffusion-based audio-driven talking avatars by introducing a mixed-supervised distillation framework and a learnable-token multi-CFG mechanism within a teacher–student diffusion setup. The method uses high-quality data for teacher training and a broader, moderate-quality dataset for distillation, with an adaptive loss balance to enhance robustness. It also replaces costly multi-CFG runs with learnable CFG tokens that mimic CFG reasoning, achieving up to 12.5x speedups while maintaining vivid, audio-synced videos across open-set and standard datasets. These contributions enable practical, high-fidelity avatar synthesis with significantly reduced inference time and improved robustness to diverse inputs.

Abstract

Diffusion-based audio-driven talking avatar methods have recently gained attention for their high-fidelity, vivid, and expressive results. However, their slow inference speed limits practical applications. Despite the development of various distillation techniques for diffusion models, we found that naive diffusion distillation methods do not yield satisfactory results. Distilled models exhibit reduced robustness with open-set input images and a decreased correlation between audio and video compared to teacher models, undermining the advantages of diffusion models. To address this, we propose FADA (Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation). We first designed a mixed-supervised loss to leverage data of varying quality and enhance the overall model capability as well as robustness. Additionally, we propose a multi-CFG distillation with learnable tokens to utilize the correlation between audio and reference image conditions, reducing the threefold inference runs caused by multi-CFG with acceptable quality degradation. Extensive experiments across multiple datasets show that FADA generates vivid videos comparable to recent diffusion model-based methods while achieving an NFE speedup of 4.17-12.5 times. Demos are available at our webpage http://fadavatar.github.io.

FADA: Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation

TL;DR

FADA tackles the bottleneck of slow diffusion-based audio-driven talking avatars by introducing a mixed-supervised distillation framework and a learnable-token multi-CFG mechanism within a teacher–student diffusion setup. The method uses high-quality data for teacher training and a broader, moderate-quality dataset for distillation, with an adaptive loss balance to enhance robustness. It also replaces costly multi-CFG runs with learnable CFG tokens that mimic CFG reasoning, achieving up to 12.5x speedups while maintaining vivid, audio-synced videos across open-set and standard datasets. These contributions enable practical, high-fidelity avatar synthesis with significantly reduced inference time and improved robustness to diverse inputs.

Abstract

Diffusion-based audio-driven talking avatar methods have recently gained attention for their high-fidelity, vivid, and expressive results. However, their slow inference speed limits practical applications. Despite the development of various distillation techniques for diffusion models, we found that naive diffusion distillation methods do not yield satisfactory results. Distilled models exhibit reduced robustness with open-set input images and a decreased correlation between audio and video compared to teacher models, undermining the advantages of diffusion models. To address this, we propose FADA (Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation). We first designed a mixed-supervised loss to leverage data of varying quality and enhance the overall model capability as well as robustness. Additionally, we propose a multi-CFG distillation with learnable tokens to utilize the correlation between audio and reference image conditions, reducing the threefold inference runs caused by multi-CFG with acceptable quality degradation. Extensive experiments across multiple datasets show that FADA generates vivid videos comparable to recent diffusion model-based methods while achieving an NFE speedup of 4.17-12.5 times. Demos are available at our webpage http://fadavatar.github.io.

Paper Structure

This paper contains 21 sections, 10 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overall distillation framework of FADA. The teacher model is trained only with high-quality data, which is omitted in the figure. The student model is trained by a mixed loss of ground-truth and teacher-supervised loss to leverage data of varying quality. Learnable token-based CFG conditions enable the student model to mimic the multi-CFG process, further reducing inference times. For simplicity, we have omitted some components commonly used in previous methods.
  • Figure 2: Qualitative comparisons between FADA and baselines across different portraits and pronunciations in openset.
  • Figure 3: Line charts showing the variations of FVD and Sync-D metrics with different audio CFG on HDTF test set. Reference CFG is set to 2.0 in this figure.
  • Figure 4: Line charts showing the variations of FVD and Sync-D metrics with different ref CFG on HDTF test set. Audio CFG is set to 6.5 in this figure.