Replace Anyone in Videos
Xiang Wang, Shiwei Zhang, Haonan Qiu, Ruihang Chu, Zekun Li, Yingya Zhang, Changxin Gao, Yuehuan Wang, Chunhua Shen, Nong Sang
TL;DR
ReplaceAnyone addresses the problem of localized character replacement and insertion in realistic videos with dynamic backgrounds by unifying image-conditioned pose-driven video generation and masked inpainting within a diffusion-based framework. It introduces diverse mask forms, an enriched visual guidance system, a hybrid inpainting encoder, and a two-phase training strategy to prevent shape leakage, preserve background detail, and simplify optimization. Empirical results on TikTok and UBC Fashion demonstrate superior fidelity, temporal coherence, and background preservation compared with state-of-the-art baselines and image-to-video methods, and the framework generalizes to DiT-based Wan2.1. The work advances practical controllable video synthesis, enabling seamless identity transfer and pose-consistent insertions in complex scenes with potential applications in film, VR, and virtual production, while noting limitations in mask accuracy and facial/finger detail fidelity for future improvement.
Abstract
The field of controllable human-centric video generation has witnessed remarkable progress, particularly with the advent of diffusion models. However, achieving precise and localized control over human motion in videos, such as replacing or inserting individuals while preserving desired motion patterns, still remains a formidable challenge. In this work, we present the ReplaceAnyone framework, which focuses on localized human replacement and insertion featuring intricate backgrounds. Specifically, we formulate this task as an image-conditioned video inpainting paradigm with pose guidance, utilizing a unified end-to-end video diffusion architecture that facilitates image-conditioned video inpainting within masked regions. To prevent shape leakage and enable granular local control, we introduce diverse mask forms involving both regular and irregular shapes. Furthermore, we implement an enriched visual guidance mechanism to enhance appearance alignment, a hybrid inpainting encoder to further preserve the detailed background information in the masked video, and a two-phase optimization methodology to simplify the training difficulty. ReplaceAnyone enables seamless replacement or insertion of characters while maintaining the desired pose motion and reference appearance within a single framework. Extensive experimental results demonstrate the effectiveness of our method in generating realistic and coherent video content. The proposed ReplaceAnyone can be seamlessly applied not only to traditional 3D-UNet base models but also to DiT-based video models such as Wan2.1. The code will be available at https://github.com/ali-vilab/UniAnimate-DiT.
