Table of Contents
Fetching ...

VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control

Yuanpeng Tu, Hao Luo, Xi Chen, Sihui Ji, Xiang Bai, Hengshuang Zhao

TL;DR

VideoAnydoor addresses the challenge of inserting a reference object into video with both high appearance fidelity and precise motion by combining a diffusion-based inpainting backbone with an ID extractor, a pixel warper, and trajectory-guided control. The method integrates identity and motion signals through cross-attention and ControlNet, and trains on a mix of video and image data with a region-focused loss to boost fine-grained alignment. Extensive experiments show superior ID preservation, motion consistency, and user-perceived quality, while enabling applications like video virtual try-on and multi-region editing without task-specific fine-tuning. The work provides a broad, zero-shot solution for content- and motion-editing in videos, with practical impact for editing, synthesis, and media production.

Abstract

Despite significant advancements in video generation, inserting a given object into videos remains a challenging task. The difficulty lies in preserving the appearance details of the reference object and accurately modeling coherent motions at the same time. In this paper, we propose VideoAnydoor, a zero-shot video object insertion framework with high-fidelity detail preservation and precise motion control. Starting from a text-to-video model, we utilize an ID extractor to inject the global identity and leverage a box sequence to control the overall motion. To preserve the detailed appearance and meanwhile support fine-grained motion control, we design a pixel warper. It takes the reference image with arbitrary key-points and the corresponding key-point trajectories as inputs. It warps the pixel details according to the trajectories and fuses the warped features with the diffusion U-Net, thus improving detail preservation and supporting users in manipulating the motion trajectories. In addition, we propose a training strategy involving both videos and static images with a weighted loss to enhance insertion quality. VideoAnydoor demonstrates significant superiority over existing methods and naturally supports various downstream applications (e.g., talking head generation, video virtual try-on, multi-region editing) without task-specific fine-tuning.

VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control

TL;DR

VideoAnydoor addresses the challenge of inserting a reference object into video with both high appearance fidelity and precise motion by combining a diffusion-based inpainting backbone with an ID extractor, a pixel warper, and trajectory-guided control. The method integrates identity and motion signals through cross-attention and ControlNet, and trains on a mix of video and image data with a region-focused loss to boost fine-grained alignment. Extensive experiments show superior ID preservation, motion consistency, and user-perceived quality, while enabling applications like video virtual try-on and multi-region editing without task-specific fine-tuning. The work provides a broad, zero-shot solution for content- and motion-editing in videos, with practical impact for editing, synthesis, and media production.

Abstract

Despite significant advancements in video generation, inserting a given object into videos remains a challenging task. The difficulty lies in preserving the appearance details of the reference object and accurately modeling coherent motions at the same time. In this paper, we propose VideoAnydoor, a zero-shot video object insertion framework with high-fidelity detail preservation and precise motion control. Starting from a text-to-video model, we utilize an ID extractor to inject the global identity and leverage a box sequence to control the overall motion. To preserve the detailed appearance and meanwhile support fine-grained motion control, we design a pixel warper. It takes the reference image with arbitrary key-points and the corresponding key-point trajectories as inputs. It warps the pixel details according to the trajectories and fuses the warped features with the diffusion U-Net, thus improving detail preservation and supporting users in manipulating the motion trajectories. In addition, we propose a training strategy involving both videos and static images with a weighted loss to enhance insertion quality. VideoAnydoor demonstrates significant superiority over existing methods and naturally supports various downstream applications (e.g., talking head generation, video virtual try-on, multi-region editing) without task-specific fine-tuning.
Paper Structure (33 sections, 2 equations, 8 figures, 6 tables)

This paper contains 33 sections, 2 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Demonstrations for video object insertion. VideoAnydoor preserves the fine-grained object details and enables users to control the motion with boxes or point trajectories. Based on the robust insertion, users could further add multiple objects iteratively or swap objects in the same video. Compared with the previous works, VideoAnydoor demonstrates significant superiority.
  • Figure 2: The pipelines of our VideoAnydoor. First, we input the concatenation of the original video, object masks, and masked video into the 3D U-Net. Meanwhile, the background-removed reference image is fed into the ID extractor, and the obtained features are injected into the 3D U-Net. In our pixel warper, the reference image marked with key points and the trajectories are utilized as inputs for the content and motion encoders. Then, the extracted embeddings are input into cross-attentions for further fusion. The fused results serve as the input of a ControlNet, which extracts multi-scale features for fine-grained injection of motion and identity. The framework is trained with weighted losses. We use a blend of real videos and image-simulated videos for training to compensate for the data scarcity.
  • Figure 3: Pipeline of trajectory generation for training data. We first perform NMS to filter out densely-distributed points and then select points with larger motion. The retained ones can be sparsely distributed in each part of the target and contain more motion information, thus inducing more precise control.
  • Figure 4: Comparison results between VideoAnydoor and existing state-of-the-art video editing works. Our VideoAnydoor can achieve superior performance on precise control of both motion and content.
  • Figure 5: Demonstrations for precise motion control. VideoAnydoor can achieve precise alignment with the given trajectories and objects when using a pair of reference images marked with key-points and corresponding trajectory maps as input.
  • ...and 3 more figures