Table of Contents
Fetching ...

VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control

Yuxuan Bian, Zhaoyang Zhang, Xuan Ju, Mingdeng Cao, Liangbin Xie, Ying Shan, Qiang Xu

TL;DR

VideoPainter tackles the long-standing challenge of any-length video inpainting and editing by decoupling background preservation from foreground generation through a dual-branch diffusion Transformer framework. A lightweight context encoder provides backbone-aware background cues to frozen pre-trained video DiTs, enabling plug-and-play control across backbones and text prompts. An inpainting region ID resampling technique ensures identity consistency over long videos, while a scalable VPData/VPBench pipeline yields over 390K clips with precise masks and dense captions. Empirical results demonstrate state-of-the-art performance across eight metrics for both inpainting and editing, highlighting practical impact for video editing workflows and large-scale generative evaluation.

Abstract

Video inpainting, which aims to restore corrupted video content, has experienced substantial progress. Despite these advances, existing methods, whether propagating unmasked region pixels through optical flow and receptive field priors, or extending image-inpainting models temporally, face challenges in generating fully masked objects or balancing the competing objectives of background context preservation and foreground generation in one model, respectively. To address these limitations, we propose a novel dual-stream paradigm VideoPainter that incorporates an efficient context encoder (comprising only 6% of the backbone parameters) to process masked videos and inject backbone-aware background contextual cues to any pre-trained video DiT, producing semantically consistent content in a plug-and-play manner. This architectural separation significantly reduces the model's learning complexity while enabling nuanced integration of crucial background context. We also introduce a novel target region ID resampling technique that enables any-length video inpainting, greatly enhancing our practical applicability. Additionally, we establish a scalable dataset pipeline leveraging current vision understanding models, contributing VPData and VPBench to facilitate segmentation-based inpainting training and assessment, the largest video inpainting dataset and benchmark to date with over 390K diverse clips. Using inpainting as a pipeline basis, we also explore downstream applications including video editing and video editing pair data generation, demonstrating competitive performance and significant practical potential. Extensive experiments demonstrate VideoPainter's superior performance in both any-length video inpainting and editing, across eight key metrics, including video quality, mask region preservation, and textual coherence.

VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control

TL;DR

VideoPainter tackles the long-standing challenge of any-length video inpainting and editing by decoupling background preservation from foreground generation through a dual-branch diffusion Transformer framework. A lightweight context encoder provides backbone-aware background cues to frozen pre-trained video DiTs, enabling plug-and-play control across backbones and text prompts. An inpainting region ID resampling technique ensures identity consistency over long videos, while a scalable VPData/VPBench pipeline yields over 390K clips with precise masks and dense captions. Empirical results demonstrate state-of-the-art performance across eight metrics for both inpainting and editing, highlighting practical impact for video editing workflows and large-scale generative evaluation.

Abstract

Video inpainting, which aims to restore corrupted video content, has experienced substantial progress. Despite these advances, existing methods, whether propagating unmasked region pixels through optical flow and receptive field priors, or extending image-inpainting models temporally, face challenges in generating fully masked objects or balancing the competing objectives of background context preservation and foreground generation in one model, respectively. To address these limitations, we propose a novel dual-stream paradigm VideoPainter that incorporates an efficient context encoder (comprising only 6% of the backbone parameters) to process masked videos and inject backbone-aware background contextual cues to any pre-trained video DiT, producing semantically consistent content in a plug-and-play manner. This architectural separation significantly reduces the model's learning complexity while enabling nuanced integration of crucial background context. We also introduce a novel target region ID resampling technique that enables any-length video inpainting, greatly enhancing our practical applicability. Additionally, we establish a scalable dataset pipeline leveraging current vision understanding models, contributing VPData and VPBench to facilitate segmentation-based inpainting training and assessment, the largest video inpainting dataset and benchmark to date with over 390K diverse clips. Using inpainting as a pipeline basis, we also explore downstream applications including video editing and video editing pair data generation, demonstrating competitive performance and significant practical potential. Extensive experiments demonstrate VideoPainter's superior performance in both any-length video inpainting and editing, across eight key metrics, including video quality, mask region preservation, and textual coherence.

Paper Structure

This paper contains 34 sections, 1 equation, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Framework Comparison. Non-generative approaches, limited to pixel propagation from backgrounds, fail to inpaint fully segmentation-masked objects. Generative methods adapt single-branch image inpainting models to video by adding temporal attention, struggling to maintain background fidelity and generate foreground contents in one model. In contrast, VideoPainter implements a dual-branch architecture that leverages an efficient context encoder with any pre-trained DiT, decoupling video inpainting to background preservation and foreground generation, and enabling plug-and-play video inpainting control.
  • Figure 2: Dataset Construction Pipeline. It consists of five pre-processing steps: collection, annotation, splitting, selection, and captioning.
  • Figure 3: Model overview.The upper figure shows the architecture of VideoPainter. The context encoder performs video inpainting based on concatenation of the noisy latent, downsampled masks, and masked video latent via VAE. Features extracted by the context encoder are integrated into the pre-trained DiT in a group-wise and token-selective manner, where two encoder layers modulate the first and second halves of the DiT, respectively, and only the background tokens will be integrated into the backbone to prevent information ambiguity. The lower figure illustrates the inpainting ID region resampling with the ID Resample Adapter. During training, tokens of the current masked region are concatenated to the KV vectors, enhancing ID preservation of the inpainting region. During inference, the ID tokens of the last clip are concatenated to the current KV vectors, maintaining ID consistency with the last clip by resampling.
  • Figure 4: Comparison of previous inpainting methods and VideoPainter on standard and long video inpainting. More visualizations are in the demo video.
  • Figure 5: Comparison of previous editing methods and VideoPainter on standard and long video editing. More visualizations are in the demo video.
  • ...and 5 more figures