Table of Contents
Fetching ...

Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality

Zekai Luo, Zongze Du, Zhouhang Zhu, Hao Zhong, Muzhi Zhu, Wen Wang, Yuling Xi, Chenchen Jing, Hao Chen, Chunhua Shen

TL;DR

LivingSwap tackles the demand for high-fidelity, temporally coherent video face swapping in cinematic production by introducing a video reference-guided framework that uses keyframes for stable identity injection and a reference video completion module to preserve non-identity attributes. The approach is supported by Face2Face, a paired training dataset created via a role-reversing strategy, and CineFaceBench, a benchmark for film-like scenarios, enabling robust long-sequence generation. Experimental results on FF++ and CineFaceBench show state-of-the-art performance with strong identity preservation and realistic lighting, expressions, and motion, while significantly reducing manual editing effort through chunk-based temporal stitching. Additional findings include the robustness to imperfect keyframes and the beneficial effect of grayscale keyframe guidance on color stability across sequences.

Abstract

Video face swapping is crucial in film and entertainment production, where achieving high fidelity and temporal consistency over long and complex video sequences remains a significant challenge. Inspired by recent advances in reference-guided image editing, we explore whether rich visual attributes from source videos can be similarly leveraged to enhance both fidelity and temporal coherence in video face swapping. Building on this insight, this work presents LivingSwap, the first video reference guided face swapping model. Our approach employs keyframes as conditioning signals to inject the target identity, enabling flexible and controllable editing. By combining keyframe conditioning with video reference guidance, the model performs temporal stitching to ensure stable identity preservation and high-fidelity reconstruction across long video sequences. To address the scarcity of data for reference-guided training, we construct a paired face-swapping dataset, Face2Face, and further reverse the data pairs to ensure reliable ground-truth supervision. Extensive experiments demonstrate that our method achieves state-of-the-art results, seamlessly integrating the target identity with the source video's expressions, lighting, and motion, while significantly reducing manual effort in production workflows. Project webpage: https://aim-uofa.github.io/LivingSwap

Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality

TL;DR

LivingSwap tackles the demand for high-fidelity, temporally coherent video face swapping in cinematic production by introducing a video reference-guided framework that uses keyframes for stable identity injection and a reference video completion module to preserve non-identity attributes. The approach is supported by Face2Face, a paired training dataset created via a role-reversing strategy, and CineFaceBench, a benchmark for film-like scenarios, enabling robust long-sequence generation. Experimental results on FF++ and CineFaceBench show state-of-the-art performance with strong identity preservation and realistic lighting, expressions, and motion, while significantly reducing manual editing effort through chunk-based temporal stitching. Additional findings include the robustness to imperfect keyframes and the beneficial effect of grayscale keyframe guidance on color stability across sequences.

Abstract

Video face swapping is crucial in film and entertainment production, where achieving high fidelity and temporal consistency over long and complex video sequences remains a significant challenge. Inspired by recent advances in reference-guided image editing, we explore whether rich visual attributes from source videos can be similarly leveraged to enhance both fidelity and temporal coherence in video face swapping. Building on this insight, this work presents LivingSwap, the first video reference guided face swapping model. Our approach employs keyframes as conditioning signals to inject the target identity, enabling flexible and controllable editing. By combining keyframe conditioning with video reference guidance, the model performs temporal stitching to ensure stable identity preservation and high-fidelity reconstruction across long video sequences. To address the scarcity of data for reference-guided training, we construct a paired face-swapping dataset, Face2Face, and further reverse the data pairs to ensure reliable ground-truth supervision. Extensive experiments demonstrate that our method achieves state-of-the-art results, seamlessly integrating the target identity with the source video's expressions, lighting, and motion, while significantly reducing manual effort in production workflows. Project webpage: https://aim-uofa.github.io/LivingSwap

Paper Structure

This paper contains 22 sections, 9 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: (a) GAN-based approaches process videos in a frame-by-frame manner, and therefore often struggle with realism and suffer from temporal inconsistency. (b) Inpainting-based methods focus on generating the facial region based on sparse conditions, which inevitably leads to a loss of fidelity and unnatural visual artifacts. (c) Recent reference-based generation methods enable faithful utilization of rich visual attributes contained in references and demonstrate remarkable capability in preserving them.
  • Figure 2: Overview of the proposed LivingSwap framework for video face swapping. (1) Keyframes are used as temporal anchors to ensure consistent identity injection across long sequences. (2) We feed the source video as a reference, enabling high-fidelity reconstruction of non-identity attributes such as lighting and expressions. (3) By sequentially generating chunks and propagating the final frame of the previous chunk as guidance, LivingSwap achieves seamless transitions in long videos. (4) We use Per-frame Edit method to generate the data and reverse data roles to construct paired samples, ensuring reliable and artifact-free learning.
  • Figure 3: Qualitative comparison with state-of-the-art face-swapping methods. LivingSwap achieves the best overall performance, outperforming both GAN-based and diffusion-based approaches in video consistency, visual fidelity, and identity similarity. Although our keyframes are generated using Inswapper, the final results produced by LivingSwap are more stable and better preserve source attributes, even in challenging scenarios such as side profiles, occlusions, facial makeup, and complex lighting.
  • Figure 4: Visualization of the Face2Face dataset. The central plot shows the distribution of identity similarity scores between each swapped video and its corresponding original video, with the lowest 30% (red) and highest 30% (blue) highlighted. Low-similarity pairs often contain artifacts and distortions as significant identity discrepancies (left), while high-similarity pairs may contain failed swap frames, causing identity inconsistencies and flickering (right).
  • Figure 5: Qualitative comparison between the data pairs in Face2Face (by Inswapper facefusion2025) and corresponding results generated by LivingSwap. Benefiting from reversing the role in data pair and strong priors in pretrained model, LivingSwap surpasses the quality of its training data, achieving better expression consistency and overall realism. Unlike Inswapper-based results, our method avoids local failure cases—such as incomplete swaps, mismatched regions, and occlusion-induced artifacts—demonstrating its strong generalization beyond the training dataset.
  • ...and 8 more figures