VFace: A Training-Free Approach for Diffusion-Based Video Face Swapping
Sanoojan Baliah, Yohan Abeysinghe, Rusiru Thushara, Khan Muhammad, Abhinav Dhall, Karthik Nandakumar, Muhammad Haris Khan
TL;DR
VFace introduces a training-free approach to diffusion-based video face swapping by extending image-based diffusion models with three plug-and-play modules: Target Structure Guidance (TSG) to steer structural alignment, Frequency Spectrum Attention Interpolation (FSAI) to preserve identity in the attention space, and Flow-guided Attention Temporal Smoothening (FATS) to enforce temporal coherence via optical-flow-guided attention propagation. It leverages DDIM inversion to initialize source-target conditioning and achieves temporally stable, high-fidelity video swaps without model retraining. Extensive experiments on standard datasets demonstrate improved identity preservation, pose/expression fidelity, and temporal consistency (CD-FVD/FVD), outperforming baselines and existing video diffusion pipelines. The method offers a practical, modular solution for video-based face swapping with one-shot capabilities and broad compatibility with diffusion-based image swaps, highlighting significant potential for real-time or production-scale applications while acknowledging remaining flicker and occlusion limitations.
Abstract
We present a training-free, plug-and-play method, namely VFace, for high-quality face swapping in videos. It can be seamlessly integrated with image-based face swapping approaches built on diffusion models. First, we introduce a Frequency Spectrum Attention Interpolation technique to facilitate generation and intact key identity characteristics. Second, we achieve Target Structure Guidance via plug-and-play attention injection to better align the structural features from the target frame to the generation. Third, we present a Flow-Guided Attention Temporal Smoothening mechanism that enforces spatiotemporal coherence without modifying the underlying diffusion model to reduce temporal inconsistencies typically encountered in frame-wise generation. Our method requires no additional training or video-specific fine-tuning. Extensive experiments show that our method significantly enhances temporal consistency and visual fidelity, offering a practical and modular solution for video-based face swapping. Our code is available at https://github.com/Sanoojan/VFace.
