Table of Contents
Fetching ...

VFace: A Training-Free Approach for Diffusion-Based Video Face Swapping

Sanoojan Baliah, Yohan Abeysinghe, Rusiru Thushara, Khan Muhammad, Abhinav Dhall, Karthik Nandakumar, Muhammad Haris Khan

TL;DR

VFace introduces a training-free approach to diffusion-based video face swapping by extending image-based diffusion models with three plug-and-play modules: Target Structure Guidance (TSG) to steer structural alignment, Frequency Spectrum Attention Interpolation (FSAI) to preserve identity in the attention space, and Flow-guided Attention Temporal Smoothening (FATS) to enforce temporal coherence via optical-flow-guided attention propagation. It leverages DDIM inversion to initialize source-target conditioning and achieves temporally stable, high-fidelity video swaps without model retraining. Extensive experiments on standard datasets demonstrate improved identity preservation, pose/expression fidelity, and temporal consistency (CD-FVD/FVD), outperforming baselines and existing video diffusion pipelines. The method offers a practical, modular solution for video-based face swapping with one-shot capabilities and broad compatibility with diffusion-based image swaps, highlighting significant potential for real-time or production-scale applications while acknowledging remaining flicker and occlusion limitations.

Abstract

We present a training-free, plug-and-play method, namely VFace, for high-quality face swapping in videos. It can be seamlessly integrated with image-based face swapping approaches built on diffusion models. First, we introduce a Frequency Spectrum Attention Interpolation technique to facilitate generation and intact key identity characteristics. Second, we achieve Target Structure Guidance via plug-and-play attention injection to better align the structural features from the target frame to the generation. Third, we present a Flow-Guided Attention Temporal Smoothening mechanism that enforces spatiotemporal coherence without modifying the underlying diffusion model to reduce temporal inconsistencies typically encountered in frame-wise generation. Our method requires no additional training or video-specific fine-tuning. Extensive experiments show that our method significantly enhances temporal consistency and visual fidelity, offering a practical and modular solution for video-based face swapping. Our code is available at https://github.com/Sanoojan/VFace.

VFace: A Training-Free Approach for Diffusion-Based Video Face Swapping

TL;DR

VFace introduces a training-free approach to diffusion-based video face swapping by extending image-based diffusion models with three plug-and-play modules: Target Structure Guidance (TSG) to steer structural alignment, Frequency Spectrum Attention Interpolation (FSAI) to preserve identity in the attention space, and Flow-guided Attention Temporal Smoothening (FATS) to enforce temporal coherence via optical-flow-guided attention propagation. It leverages DDIM inversion to initialize source-target conditioning and achieves temporally stable, high-fidelity video swaps without model retraining. Extensive experiments on standard datasets demonstrate improved identity preservation, pose/expression fidelity, and temporal consistency (CD-FVD/FVD), outperforming baselines and existing video diffusion pipelines. The method offers a practical, modular solution for video-based face swapping with one-shot capabilities and broad compatibility with diffusion-based image swaps, highlighting significant potential for real-time or production-scale applications while acknowledging remaining flicker and occlusion limitations.

Abstract

We present a training-free, plug-and-play method, namely VFace, for high-quality face swapping in videos. It can be seamlessly integrated with image-based face swapping approaches built on diffusion models. First, we introduce a Frequency Spectrum Attention Interpolation technique to facilitate generation and intact key identity characteristics. Second, we achieve Target Structure Guidance via plug-and-play attention injection to better align the structural features from the target frame to the generation. Third, we present a Flow-Guided Attention Temporal Smoothening mechanism that enforces spatiotemporal coherence without modifying the underlying diffusion model to reduce temporal inconsistencies typically encountered in frame-wise generation. Our method requires no additional training or video-specific fine-tuning. Extensive experiments show that our method significantly enhances temporal consistency and visual fidelity, offering a practical and modular solution for video-based face swapping. Our code is available at https://github.com/Sanoojan/VFace.
Paper Structure (15 sections, 7 equations, 8 figures, 9 tables)

This paper contains 15 sections, 7 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Our method (VFace) for video face swapping effectively guides the structure of the target video (T) while preserving the identity features from the source image (S). Video animation can be viewed using Acrobat Reader (Click to play).
  • Figure 2: VFace overview with its key modules for video face swapping. The pipeline consists of three core components: (1) Target Structure Guidance (TSG), which aligns structural features from the target video to guide generation; (2) Frequency Spectrum Attention Interpolation (FSAI), which performs attention feature blending in the frequency domain to decouple identity and structure cues; and (3) Flow-guided Attention Temporal Smoothening (FATS), which ensures temporal coherence by propagating attention features across frames using optical flow. Together, these modules enable identity-preserving, structure-aware, and temporally consistent face-swapping.
  • Figure 3: Low pass and High pass filtered source images.
  • Figure 4: Visualization of the first five channels of the $q$ vector from our FSAI module at the 40th DDIM step, reshaped into square maps.
  • Figure 5: Video comparison with REFace as baseline on VFHQ dataset. Play the video with adobe reader
  • ...and 3 more figures