Table of Contents
Fetching ...

MTV-Inpaint: Multi-Task Long Video Inpainting

Shiyuan Yang, Zheng Gu, Liang Hou, Xin Tao, Pengfei Wan, Xiaodong Chen, Jing Liao

TL;DR

MTV-Inpaint addresses the challenge of unified, controllable video inpainting for long sequences by introducing a dual-branch spatial-attention U-Net that jointly handles object insertion and scene completion. It achieves enhanced controllability through I2V inpainting and a two-stage long-video pipeline (keyframe plus in-between) to maintain temporal coherence across hundreds of frames. The approach delivers state-of-the-art results on object insertion and scene completion, with strong performance in derived tasks such as object editing, removal, and multi-modal guidance. This framework broadens practical video editing capabilities by enabling multimodal inputs and scalable long-video processing without sacrificing quality or coherence.

Abstract

Video inpainting involves modifying local regions within a video, ensuring spatial and temporal consistency. Most existing methods focus primarily on scene completion (i.e., filling missing regions) and lack the capability to insert new objects into a scene in a controllable manner. Fortunately, recent advancements in text-to-video (T2V) diffusion models pave the way for text-guided video inpainting. However, directly adapting T2V models for inpainting remains limited in unifying completion and insertion tasks, lacks input controllability, and struggles with long videos, thereby restricting their applicability and flexibility. To address these challenges, we propose MTV-Inpaint, a unified multi-task video inpainting framework capable of handling both traditional scene completion and novel object insertion tasks. To unify these distinct tasks, we design a dual-branch spatial attention mechanism in the T2V diffusion U-Net, enabling seamless integration of scene completion and object insertion within a single framework. In addition to textual guidance, MTV-Inpaint supports multimodal control by integrating various image inpainting models through our proposed image-to-video (I2V) inpainting mode. Additionally, we propose a two-stage pipeline that combines keyframe inpainting with in-between frame propagation, enabling MTV-Inpaint to effectively handle long videos with hundreds of frames. Extensive experiments demonstrate that MTV-Inpaint achieves state-of-the-art performance in both scene completion and object insertion tasks. Furthermore, it demonstrates versatility in derived applications such as multi-modal inpainting, object editing, removal, image object brush, and the ability to handle long videos. Project page: https://mtv-inpaint.github.io/.

MTV-Inpaint: Multi-Task Long Video Inpainting

TL;DR

MTV-Inpaint addresses the challenge of unified, controllable video inpainting for long sequences by introducing a dual-branch spatial-attention U-Net that jointly handles object insertion and scene completion. It achieves enhanced controllability through I2V inpainting and a two-stage long-video pipeline (keyframe plus in-between) to maintain temporal coherence across hundreds of frames. The approach delivers state-of-the-art results on object insertion and scene completion, with strong performance in derived tasks such as object editing, removal, and multi-modal guidance. This framework broadens practical video editing capabilities by enabling multimodal inputs and scalable long-video processing without sacrificing quality or coherence.

Abstract

Video inpainting involves modifying local regions within a video, ensuring spatial and temporal consistency. Most existing methods focus primarily on scene completion (i.e., filling missing regions) and lack the capability to insert new objects into a scene in a controllable manner. Fortunately, recent advancements in text-to-video (T2V) diffusion models pave the way for text-guided video inpainting. However, directly adapting T2V models for inpainting remains limited in unifying completion and insertion tasks, lacks input controllability, and struggles with long videos, thereby restricting their applicability and flexibility. To address these challenges, we propose MTV-Inpaint, a unified multi-task video inpainting framework capable of handling both traditional scene completion and novel object insertion tasks. To unify these distinct tasks, we design a dual-branch spatial attention mechanism in the T2V diffusion U-Net, enabling seamless integration of scene completion and object insertion within a single framework. In addition to textual guidance, MTV-Inpaint supports multimodal control by integrating various image inpainting models through our proposed image-to-video (I2V) inpainting mode. Additionally, we propose a two-stage pipeline that combines keyframe inpainting with in-between frame propagation, enabling MTV-Inpaint to effectively handle long videos with hundreds of frames. Extensive experiments demonstrate that MTV-Inpaint achieves state-of-the-art performance in both scene completion and object insertion tasks. Furthermore, it demonstrates versatility in derived applications such as multi-modal inpainting, object editing, removal, image object brush, and the ability to handle long videos. Project page: https://mtv-inpaint.github.io/.

Paper Structure

This paper contains 42 sections, 3 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Our VideoPaint framework. During training, we train object insertion and scene completion tasks with dual-branch U-Net, using object-aware masks and random masks respectively. Concurrently, we employ three frame masking modes: text-to-video(T2V), image-to-video (I2V), and keyframe-to-video (K2V). During the inference, our method can perform basic T2V inpainting, or I2V inpainting, given that the first frame is obtained from 3rd party image inpainting tool. To handle longer video, we first use T2V/I2V mode to inpaint keyframes, then use K2V mode to inpaint remaining in-between frames.
  • Figure 2: Quantitative comparison for object insertion evaluation. We recommend watching our supplementary video for dynamic results. Methods marked with an asterisk are not existing works but have been implemented by us.
  • Figure 3: Quantitative comparison for scene completion evaluation. We recommend watching our supplementary video for dynamic results.
  • Figure 4: User study results of different methods on (a) object insertion task, and (b) scene completion task.
  • Figure 5: Visual examples from different long video generation strategies.
  • ...and 8 more figures