Table of Contents
Fetching ...

Superior and Pragmatic Talking Face Generation with Teacher-Student Framework

Chao Liang, Jianwen Jiang, Tianyun Zhong, Gaojie Lin, Zhengkun Rong, Jiaqi Yang, Yongming Zhu

TL;DR

This paper tackles practical talking head generation by addressing quality, robustness to degraded inputs, computational efficiency, and editable control in a unified framework. It introduces SuperFace, a teacher-student system where a high-capacity teacher employing Simulation for Super-Resolution (SSR) and a Motion-Enhancing Mechanism (MEM) learns 3D-aware motion and robust synthesis, then distills its knowledge into an identity-specific lightweight student that runs with dramatically reduced FLOPs. The method also adds a Mask Training Mechanism (MTM) for decoupled local editing and an audio-to-lip module for cross-modal control, enabling flexible, real-time editing across modalities. Empirical results show the teacher surpasses state-of-the-art baselines in video- and audio-driven settings, while the student achieves comparable performance with two orders of magnitude lower computation and strong identity generalization. Overall, SuperFace delivers a practical, high-quality solution for real-world talking head generation with editable, cross-modal capabilities and efficient deployment potential.

Abstract

Talking face generation technology creates talking videos from arbitrary appearance and motion signal, with the "arbitrary" offering ease of use but also introducing challenges in practical applications. Existing methods work well with standard inputs but suffer serious performance degradation with intricate real-world ones. Moreover, efficiency is also an important concern in deployment. To comprehensively address these issues, we introduce SuperFace, a teacher-student framework that balances quality, robustness, cost and editability. We first propose a simple but effective teacher model capable of handling inputs of varying qualities to generate high-quality results. Building on this, we devise an efficient distillation strategy to acquire an identity-specific student model that maintains quality with significantly reduced computational load. Our experiments validate that SuperFace offers a more comprehensive solution than existing methods for the four mentioned objectives, especially in reducing FLOPs by 99\% with the student model. SuperFace can be driven by both video and audio and allows for localized facial attributes editing.

Superior and Pragmatic Talking Face Generation with Teacher-Student Framework

TL;DR

This paper tackles practical talking head generation by addressing quality, robustness to degraded inputs, computational efficiency, and editable control in a unified framework. It introduces SuperFace, a teacher-student system where a high-capacity teacher employing Simulation for Super-Resolution (SSR) and a Motion-Enhancing Mechanism (MEM) learns 3D-aware motion and robust synthesis, then distills its knowledge into an identity-specific lightweight student that runs with dramatically reduced FLOPs. The method also adds a Mask Training Mechanism (MTM) for decoupled local editing and an audio-to-lip module for cross-modal control, enabling flexible, real-time editing across modalities. Empirical results show the teacher surpasses state-of-the-art baselines in video- and audio-driven settings, while the student achieves comparable performance with two orders of magnitude lower computation and strong identity generalization. Overall, SuperFace delivers a practical, high-quality solution for real-world talking head generation with editable, cross-modal capabilities and efficient deployment potential.

Abstract

Talking face generation technology creates talking videos from arbitrary appearance and motion signal, with the "arbitrary" offering ease of use but also introducing challenges in practical applications. Existing methods work well with standard inputs but suffer serious performance degradation with intricate real-world ones. Moreover, efficiency is also an important concern in deployment. To comprehensively address these issues, we introduce SuperFace, a teacher-student framework that balances quality, robustness, cost and editability. We first propose a simple but effective teacher model capable of handling inputs of varying qualities to generate high-quality results. Building on this, we devise an efficient distillation strategy to acquire an identity-specific student model that maintains quality with significantly reduced computational load. Our experiments validate that SuperFace offers a more comprehensive solution than existing methods for the four mentioned objectives, especially in reducing FLOPs by 99\% with the student model. SuperFace can be driven by both video and audio and allows for localized facial attributes editing.
Paper Structure (18 sections, 2 equations, 9 figures, 5 tables)

This paper contains 18 sections, 2 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Our proposed method outperforms state-of-the-art ones in the following aspects: 1) Quality: producing higher-quality results (top left); 2) Robustness: maintaining robustness even with poor-quality input (top right); 3) Editability: enabling users freely edit facial attributes (bottom left); 4) Low cost: achieving comparable result by distillation with 99% reduction in FLOPs (bottom right).
  • Figure 2: The pipeline of our proposed framework. SuperFace consists of a teacher model and a student model. a) We first train an extremely powerful teacher model using MEM (motion-enhancing mechanism) and SSR (simulation for super-resolution). b) Then we distill its knowledge into an efficient student through feature delivery.
  • Figure 3: Our proposed MEM incorporates 2D inputs and 3D priors to accurate depict of the motion. 3D priors are fully utilized through early/late infusion and two distinct representation designs.
  • Figure 4: Pipeline of our SSR. We introduce a second-order degradation. Each order consists of random blurring, resizing, adding noise, and JPEG compression.
  • Figure 5: Pipeline of our MTM. We mask the vanilla driving signals and replace them with other ones. It facilicates local editing and crossmodel-driving.
  • ...and 4 more figures