Table of Contents
Fetching ...

SwapTalk: Audio-Driven Talking Face Generation with One-Shot Customization in Latent Space

Zeren Zhang, Haibo Qin, Jiayu Huang, Yixin Li, Hui Lin, Yitao Duan, Jinwen Ma

TL;DR

SwapTalk addresses the interference and fidelity challenges of combining face swapping with lip synchronization by performing both tasks in a shared, high-fidelity $VQ$-embedding latent space derived from a pre-trained $VQGAN$. The framework independently trains a Transformer-based face-swapping module and a latent-space lip-sync module, augmented with identity losses and a lip-sync expert, and uses a latent-space inference order that prioritizes swapping before lip-sync. Key contributions include leveraging the $VQ$-embedding space to reduce computation, improving generalization to unseen identities, introducing an in-latent-space lip-sync supervision, and proposing a novel identity-consistency metric for video sequences. Extensive experiments on the HDTF dataset, including self-driven and cross-driven settings, demonstrate that SwapTalk outperforms cascading baselines and end-to-end models in video quality, lip synchronization accuracy, face-swapping fidelity, and identity consistency, with additional data further boosting performance. The work has practical impact for real-world customized talking-face generation, providing a robust, efficient pipeline and rigorous evaluation framework."

Abstract

Combining face swapping with lip synchronization technology offers a cost-effective solution for customized talking face generation. However, directly cascading existing models together tends to introduce significant interference between tasks and reduce video clarity because the interaction space is limited to the low-level semantic RGB space. To address this issue, we propose an innovative unified framework, SwapTalk, which accomplishes both face swapping and lip synchronization tasks in the same latent space. Referring to recent work on face generation, we choose the VQ-embedding space due to its excellent editability and fidelity performance. To enhance the framework's generalization capabilities for unseen identities, we incorporate identity loss during the training of the face swapping module. Additionally, we introduce expert discriminator supervision within the latent space during the training of the lip synchronization module to elevate synchronization quality. In the evaluation phase, previous studies primarily focused on the self-reconstruction of lip movements in synchronous audio-visual videos. To better approximate real-world applications, we expand the evaluation scope to asynchronous audio-video scenarios. Furthermore, we introduce a novel identity consistency metric to more comprehensively assess the identity consistency over time series in generated facial videos. Experimental results on the HDTF demonstrate that our method significantly surpasses existing techniques in video quality, lip synchronization accuracy, face swapping fidelity, and identity consistency. Our demo is available at http://swaptalk.cc.

SwapTalk: Audio-Driven Talking Face Generation with One-Shot Customization in Latent Space

TL;DR

SwapTalk addresses the interference and fidelity challenges of combining face swapping with lip synchronization by performing both tasks in a shared, high-fidelity -embedding latent space derived from a pre-trained . The framework independently trains a Transformer-based face-swapping module and a latent-space lip-sync module, augmented with identity losses and a lip-sync expert, and uses a latent-space inference order that prioritizes swapping before lip-sync. Key contributions include leveraging the -embedding space to reduce computation, improving generalization to unseen identities, introducing an in-latent-space lip-sync supervision, and proposing a novel identity-consistency metric for video sequences. Extensive experiments on the HDTF dataset, including self-driven and cross-driven settings, demonstrate that SwapTalk outperforms cascading baselines and end-to-end models in video quality, lip synchronization accuracy, face-swapping fidelity, and identity consistency, with additional data further boosting performance. The work has practical impact for real-world customized talking-face generation, providing a robust, efficient pipeline and rigorous evaluation framework."

Abstract

Combining face swapping with lip synchronization technology offers a cost-effective solution for customized talking face generation. However, directly cascading existing models together tends to introduce significant interference between tasks and reduce video clarity because the interaction space is limited to the low-level semantic RGB space. To address this issue, we propose an innovative unified framework, SwapTalk, which accomplishes both face swapping and lip synchronization tasks in the same latent space. Referring to recent work on face generation, we choose the VQ-embedding space due to its excellent editability and fidelity performance. To enhance the framework's generalization capabilities for unseen identities, we incorporate identity loss during the training of the face swapping module. Additionally, we introduce expert discriminator supervision within the latent space during the training of the lip synchronization module to elevate synchronization quality. In the evaluation phase, previous studies primarily focused on the self-reconstruction of lip movements in synchronous audio-visual videos. To better approximate real-world applications, we expand the evaluation scope to asynchronous audio-video scenarios. Furthermore, we introduce a novel identity consistency metric to more comprehensively assess the identity consistency over time series in generated facial videos. Experimental results on the HDTF demonstrate that our method significantly surpasses existing techniques in video quality, lip synchronization accuracy, face swapping fidelity, and identity consistency. Our demo is available at http://swaptalk.cc.
Paper Structure (25 sections, 17 equations, 5 figures, 6 tables)

This paper contains 25 sections, 17 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Our model is capable of transferring the facial region of a user-defined personalized avatar (source ID) onto a specified target template, while also accommodating lip shape deformations to ensure that the lip movements in the generated video are synchronized with the user-specified audio content.
  • Figure 2: (a) Details of the encoding process of the VQ Encoder. (b) The overall framework of our proposed method. The facial image is first encoded into the VQ-embedding space. Then, the face swapping module (c) and the lip-sync module (d) handle face swapping and lip synchronization, respectively. Finally, the VQ Decoder converts the output back into RGB space, producing a customized talking face video.
  • Figure 3: Our proposed method is compared with Wav2Lip under two settings: self-driven and cross-driven.
  • Figure 4: In the cross-driven scenario, a visual comparison of various cascade methods with the method we propose.
  • Figure 5: The impact of different VQ space compression ratios on the quality of generated videos.