Table of Contents
Fetching ...

Replace Anyone in Videos

Xiang Wang, Shiwei Zhang, Haonan Qiu, Ruihang Chu, Zekun Li, Yingya Zhang, Changxin Gao, Yuehuan Wang, Chunhua Shen, Nong Sang

TL;DR

ReplaceAnyone addresses the problem of localized character replacement and insertion in realistic videos with dynamic backgrounds by unifying image-conditioned pose-driven video generation and masked inpainting within a diffusion-based framework. It introduces diverse mask forms, an enriched visual guidance system, a hybrid inpainting encoder, and a two-phase training strategy to prevent shape leakage, preserve background detail, and simplify optimization. Empirical results on TikTok and UBC Fashion demonstrate superior fidelity, temporal coherence, and background preservation compared with state-of-the-art baselines and image-to-video methods, and the framework generalizes to DiT-based Wan2.1. The work advances practical controllable video synthesis, enabling seamless identity transfer and pose-consistent insertions in complex scenes with potential applications in film, VR, and virtual production, while noting limitations in mask accuracy and facial/finger detail fidelity for future improvement.

Abstract

The field of controllable human-centric video generation has witnessed remarkable progress, particularly with the advent of diffusion models. However, achieving precise and localized control over human motion in videos, such as replacing or inserting individuals while preserving desired motion patterns, still remains a formidable challenge. In this work, we present the ReplaceAnyone framework, which focuses on localized human replacement and insertion featuring intricate backgrounds. Specifically, we formulate this task as an image-conditioned video inpainting paradigm with pose guidance, utilizing a unified end-to-end video diffusion architecture that facilitates image-conditioned video inpainting within masked regions. To prevent shape leakage and enable granular local control, we introduce diverse mask forms involving both regular and irregular shapes. Furthermore, we implement an enriched visual guidance mechanism to enhance appearance alignment, a hybrid inpainting encoder to further preserve the detailed background information in the masked video, and a two-phase optimization methodology to simplify the training difficulty. ReplaceAnyone enables seamless replacement or insertion of characters while maintaining the desired pose motion and reference appearance within a single framework. Extensive experimental results demonstrate the effectiveness of our method in generating realistic and coherent video content. The proposed ReplaceAnyone can be seamlessly applied not only to traditional 3D-UNet base models but also to DiT-based video models such as Wan2.1. The code will be available at https://github.com/ali-vilab/UniAnimate-DiT.

Replace Anyone in Videos

TL;DR

ReplaceAnyone addresses the problem of localized character replacement and insertion in realistic videos with dynamic backgrounds by unifying image-conditioned pose-driven video generation and masked inpainting within a diffusion-based framework. It introduces diverse mask forms, an enriched visual guidance system, a hybrid inpainting encoder, and a two-phase training strategy to prevent shape leakage, preserve background detail, and simplify optimization. Empirical results on TikTok and UBC Fashion demonstrate superior fidelity, temporal coherence, and background preservation compared with state-of-the-art baselines and image-to-video methods, and the framework generalizes to DiT-based Wan2.1. The work advances practical controllable video synthesis, enabling seamless identity transfer and pose-consistent insertions in complex scenes with potential applications in film, VR, and virtual production, while noting limitations in mask accuracy and facial/finger detail fidelity for future improvement.

Abstract

The field of controllable human-centric video generation has witnessed remarkable progress, particularly with the advent of diffusion models. However, achieving precise and localized control over human motion in videos, such as replacing or inserting individuals while preserving desired motion patterns, still remains a formidable challenge. In this work, we present the ReplaceAnyone framework, which focuses on localized human replacement and insertion featuring intricate backgrounds. Specifically, we formulate this task as an image-conditioned video inpainting paradigm with pose guidance, utilizing a unified end-to-end video diffusion architecture that facilitates image-conditioned video inpainting within masked regions. To prevent shape leakage and enable granular local control, we introduce diverse mask forms involving both regular and irregular shapes. Furthermore, we implement an enriched visual guidance mechanism to enhance appearance alignment, a hybrid inpainting encoder to further preserve the detailed background information in the masked video, and a two-phase optimization methodology to simplify the training difficulty. ReplaceAnyone enables seamless replacement or insertion of characters while maintaining the desired pose motion and reference appearance within a single framework. Extensive experimental results demonstrate the effectiveness of our method in generating realistic and coherent video content. The proposed ReplaceAnyone can be seamlessly applied not only to traditional 3D-UNet base models but also to DiT-based video models such as Wan2.1. The code will be available at https://github.com/ali-vilab/UniAnimate-DiT.
Paper Structure (17 sections, 4 equations, 15 figures, 6 tables)

This paper contains 17 sections, 4 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: Video demo examples synthesized by the proposed ReplaceAnyone. Our ReplaceAnyone enables character replacement or insertion in a source video with dynamic backgrounds using a reference image, preserving both the desired pose motion and reference appearance.
  • Figure 2: Overall framework of ReplaceAnyone. We use a unified video diffusion model to perform image-conditioned pose-driven video generation and video inpainting tasks simultaneously. In order to encode reference image information comprehensively, we design an enriched visual guidance mechanism to extract mask, pose and segmented image features respectively. Moreover, a variety of mask forms are designed to prevent the leakage of segmentation shape information and facilitate the fine-grained control. To preserve the details in the masked video, we design a hybrid inpainting encoder, which consists of a learnable inpainting encoder and a VAE encoder. Masked encoder, pose encoder, and inpainting encoder have similar structures, consisting of several learnable layers of downsampled convolutions to reduce computational complexity.
  • Figure 3: Illustration of shape leakage. If only the original character mask is used for training, the network will overfit to the information of the masked shape, resulting in obvious discordant parts during inference, such as "four hands" or unrealistic padding. This problem can be significantly alleviated by introducing diverse mask forms.
  • Figure 4: Importance of incorporating reference mask. If the reference mask is not introduced, the model may mistakenly regard the black background around the reference character as part of the character, resulting in unrealistic generation.
  • Figure 5: Qualitative comparison between VAE encoder and our hybrid inpainting encoder. The proposed hybrid inpainting encoder can utilize the powerful semantic encoding capability in VAE encoder and the detail preservation capability of learnable inpainting encoder to enhance the preservation effect for background information in the masked video frames.
  • ...and 10 more figures