Superior and Pragmatic Talking Face Generation with Teacher-Student Framework
Chao Liang, Jianwen Jiang, Tianyun Zhong, Gaojie Lin, Zhengkun Rong, Jiaqi Yang, Yongming Zhu
TL;DR
This paper tackles practical talking head generation by addressing quality, robustness to degraded inputs, computational efficiency, and editable control in a unified framework. It introduces SuperFace, a teacher-student system where a high-capacity teacher employing Simulation for Super-Resolution (SSR) and a Motion-Enhancing Mechanism (MEM) learns 3D-aware motion and robust synthesis, then distills its knowledge into an identity-specific lightweight student that runs with dramatically reduced FLOPs. The method also adds a Mask Training Mechanism (MTM) for decoupled local editing and an audio-to-lip module for cross-modal control, enabling flexible, real-time editing across modalities. Empirical results show the teacher surpasses state-of-the-art baselines in video- and audio-driven settings, while the student achieves comparable performance with two orders of magnitude lower computation and strong identity generalization. Overall, SuperFace delivers a practical, high-quality solution for real-world talking head generation with editable, cross-modal capabilities and efficient deployment potential.
Abstract
Talking face generation technology creates talking videos from arbitrary appearance and motion signal, with the "arbitrary" offering ease of use but also introducing challenges in practical applications. Existing methods work well with standard inputs but suffer serious performance degradation with intricate real-world ones. Moreover, efficiency is also an important concern in deployment. To comprehensively address these issues, we introduce SuperFace, a teacher-student framework that balances quality, robustness, cost and editability. We first propose a simple but effective teacher model capable of handling inputs of varying qualities to generate high-quality results. Building on this, we devise an efficient distillation strategy to acquire an identity-specific student model that maintains quality with significantly reduced computational load. Our experiments validate that SuperFace offers a more comprehensive solution than existing methods for the four mentioned objectives, especially in reducing FLOPs by 99\% with the student model. SuperFace can be driven by both video and audio and allows for localized facial attributes editing.
