DiTPainter: Efficient Video Inpainting with Diffusion Transformers
Xian Wu, Chang Liu
TL;DR
The paper tackles video inpainting by addressing the inefficiency of large pretrained diffusion models and the temporal inconsistencies of flow-based propagation. It introduces DiTPainter, a compact Diffusion Transformer trained from scratch, paired with a 3D WF-VAE encoder to operate in a latent space. Key innovations include Flow Matching as the diffusion scheduler, MultiDiffusion for temporal coherence across arbitrary-length videos, and a two-stage coarse-to-fine training strategy. Empirical results show competitive or superior visual quality and temporal consistency with reduced compute, enabling practical tasks such as video decaptioning and completion.
Abstract
Many existing video inpainting algorithms utilize optical flows to construct the corresponding maps and then propagate pixels from adjacent frames to missing areas by mapping. Despite the effectiveness of the propagation mechanism, they might encounter blurry and inconsistencies when dealing with inaccurate optical flows or large masks. Recently, Diffusion Transformer (DiT) has emerged as a revolutionary technique for video generation tasks. However, pretrained DiT models for video generation all contain a large amount of parameters, which makes it very time consuming to apply to video inpainting tasks. In this paper, we present DiTPainter, an end-to-end video inpainting model based on Diffusion Transformer (DiT). DiTPainter uses an efficient transformer network designed for video inpainting, which is trained from scratch instead of initializing from any large pretrained models. DiTPainter can address videos with arbitrary lengths and can be applied to video decaptioning and video completion tasks with an acceptable time cost. Experiments show that DiTPainter outperforms existing video inpainting algorithms with higher quality and better spatial-temporal consistency.
