Table of Contents
Fetching ...

SwapAnyone: Consistent and Realistic Video Synthesis for Swapping Any Person into Any Video

Chengshu Zhao, Yunyang Ge, Xinhua Cheng, Bin Zhu, Yatian Pang, Bin Lin, Fan Yang, Feng Gao, Li Yuan

TL;DR

SwapAnyone tackles video body-swapping by defining it as an independent end-to-end task constrained by identity, motion, and environment consistencies. It introduces an end-to-end architecture with an Inpainting UNet, Temporal Layers, an ID Extraction Module, and a Motion Control Module, guided by a CLIP image encoder and reinforced by the EnvHarmony luminance-regularization strategy. A new dataset, HumanAction-32K, supports diverse human-action videos for training and evaluation. Empirical results show state-of-the-art performance among open-source methods and competitive results with closed-source systems, demonstrating robust identity fidelity, motion accuracy, and seamless background integration. The work enables practical, high-fidelity editing of existing videos using a reference body while preserving environmental harmony and luminance consistency.

Abstract

Video body-swapping aims to replace the body in an existing video with a new body from arbitrary sources, which has garnered more attention in recent years. Existing methods treat video body-swapping as a composite of multiple tasks instead of an independent task and typically rely on various models to achieve video body-swapping sequentially. However, these methods fail to achieve end-to-end optimization for the video body-swapping which causes issues such as variations in luminance among frames, disorganized occlusion relationships, and the noticeable separation between bodies and background. In this work, we define video body-swapping as an independent task and propose three critical consistencies: identity consistency, motion consistency, and environment consistency. We introduce an end-to-end model named SwapAnyone, treating video body-swapping as a video inpainting task with reference fidelity and motion control. To improve the ability to maintain environmental harmony, particularly luminance harmony in the resulting video, we introduce a novel EnvHarmony strategy for training our model progressively. Additionally, we provide a dataset named HumanAction-32K covering various videos about human actions. Extensive experiments demonstrate that our method achieves State-Of-The-Art (SOTA) performance among open-source methods while approaching or surpassing closed-source models across multiple dimensions. All code, model weights, and the HumanAction-32K dataset will be open-sourced at https://github.com/PKU-YuanGroup/SwapAnyone.

SwapAnyone: Consistent and Realistic Video Synthesis for Swapping Any Person into Any Video

TL;DR

SwapAnyone tackles video body-swapping by defining it as an independent end-to-end task constrained by identity, motion, and environment consistencies. It introduces an end-to-end architecture with an Inpainting UNet, Temporal Layers, an ID Extraction Module, and a Motion Control Module, guided by a CLIP image encoder and reinforced by the EnvHarmony luminance-regularization strategy. A new dataset, HumanAction-32K, supports diverse human-action videos for training and evaluation. Empirical results show state-of-the-art performance among open-source methods and competitive results with closed-source systems, demonstrating robust identity fidelity, motion accuracy, and seamless background integration. The work enables practical, high-fidelity editing of existing videos using a reference body while preserving environmental harmony and luminance consistency.

Abstract

Video body-swapping aims to replace the body in an existing video with a new body from arbitrary sources, which has garnered more attention in recent years. Existing methods treat video body-swapping as a composite of multiple tasks instead of an independent task and typically rely on various models to achieve video body-swapping sequentially. However, these methods fail to achieve end-to-end optimization for the video body-swapping which causes issues such as variations in luminance among frames, disorganized occlusion relationships, and the noticeable separation between bodies and background. In this work, we define video body-swapping as an independent task and propose three critical consistencies: identity consistency, motion consistency, and environment consistency. We introduce an end-to-end model named SwapAnyone, treating video body-swapping as a video inpainting task with reference fidelity and motion control. To improve the ability to maintain environmental harmony, particularly luminance harmony in the resulting video, we introduce a novel EnvHarmony strategy for training our model progressively. Additionally, we provide a dataset named HumanAction-32K covering various videos about human actions. Extensive experiments demonstrate that our method achieves State-Of-The-Art (SOTA) performance among open-source methods while approaching or surpassing closed-source models across multiple dimensions. All code, model weights, and the HumanAction-32K dataset will be open-sourced at https://github.com/PKU-YuanGroup/SwapAnyone.

Paper Structure

This paper contains 15 sections, 7 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: SwapAnyone allows users to provide a reference body image and a target video from any source, then seamlessly swap the provided body with the original body in the target video to produce a highly realistic video.
  • Figure 2: Overview of SwapAnyone. Firstly, the user-provided reference body image and corresponding DWpose image are processed by the ID Extraction Module. Simultaneously, the DWpose sequence of the body in the target video is sent to the Motion Control Module to extract motion features, which are incorporated into the latents. Subsequently, the latents are then passed into the Inpainting UNet, which integrates features from the ID Extraction Module via self-attention operation together. Meanwhile, the reference body image is processed by CLIP image encoder to extract features, enabling semantic integration via cross-attention in both the ID Extraction Module and the Inpainting UNet. After denoising, the model outputs a resulting video that replaces the body in the target video with the reference body.
  • Figure 3: Comparison of visual quality across different methods. Viggle AI effectively preserves the identity of the reference body, but a noticeable boundary remains between the body and the background. Additionally, it struggles to handle occlusions between the body and objects in the background. The Inpainting model with MimicMotion struggles with background fidelity. The Inpainting model with IP Adapter and ControlNet lacks temporal modeling ability, leading to identity variations across frames. Our SwapAnyone maintains consistency of identity with the reference across frames while seamlessly blending the body with the background.
  • Figure 4: Ablation study visual comparison. The EnvHarmony strategy with data augmentation and MSE loss produces the most refined results. Additionally, the 9-channel Inpainting UNet achieves better background fidelity.
  • Figure 5: User study results of the four methods, where the Human Preference Percentage represents the proportion of best-performance evaluations each method receives in each dimension. SwapAnyone achieves performance comparable to the closed-source Viggle AI while surpassing other open-source models across all dimensions.