Table of Contents
Fetching ...

DiTPainter: Efficient Video Inpainting with Diffusion Transformers

Xian Wu, Chang Liu

TL;DR

The paper tackles video inpainting by addressing the inefficiency of large pretrained diffusion models and the temporal inconsistencies of flow-based propagation. It introduces DiTPainter, a compact Diffusion Transformer trained from scratch, paired with a 3D WF-VAE encoder to operate in a latent space. Key innovations include Flow Matching as the diffusion scheduler, MultiDiffusion for temporal coherence across arbitrary-length videos, and a two-stage coarse-to-fine training strategy. Empirical results show competitive or superior visual quality and temporal consistency with reduced compute, enabling practical tasks such as video decaptioning and completion.

Abstract

Many existing video inpainting algorithms utilize optical flows to construct the corresponding maps and then propagate pixels from adjacent frames to missing areas by mapping. Despite the effectiveness of the propagation mechanism, they might encounter blurry and inconsistencies when dealing with inaccurate optical flows or large masks. Recently, Diffusion Transformer (DiT) has emerged as a revolutionary technique for video generation tasks. However, pretrained DiT models for video generation all contain a large amount of parameters, which makes it very time consuming to apply to video inpainting tasks. In this paper, we present DiTPainter, an end-to-end video inpainting model based on Diffusion Transformer (DiT). DiTPainter uses an efficient transformer network designed for video inpainting, which is trained from scratch instead of initializing from any large pretrained models. DiTPainter can address videos with arbitrary lengths and can be applied to video decaptioning and video completion tasks with an acceptable time cost. Experiments show that DiTPainter outperforms existing video inpainting algorithms with higher quality and better spatial-temporal consistency.

DiTPainter: Efficient Video Inpainting with Diffusion Transformers

TL;DR

The paper tackles video inpainting by addressing the inefficiency of large pretrained diffusion models and the temporal inconsistencies of flow-based propagation. It introduces DiTPainter, a compact Diffusion Transformer trained from scratch, paired with a 3D WF-VAE encoder to operate in a latent space. Key innovations include Flow Matching as the diffusion scheduler, MultiDiffusion for temporal coherence across arbitrary-length videos, and a two-stage coarse-to-fine training strategy. Empirical results show competitive or superior visual quality and temporal consistency with reduced compute, enabling practical tasks such as video decaptioning and completion.

Abstract

Many existing video inpainting algorithms utilize optical flows to construct the corresponding maps and then propagate pixels from adjacent frames to missing areas by mapping. Despite the effectiveness of the propagation mechanism, they might encounter blurry and inconsistencies when dealing with inaccurate optical flows or large masks. Recently, Diffusion Transformer (DiT) has emerged as a revolutionary technique for video generation tasks. However, pretrained DiT models for video generation all contain a large amount of parameters, which makes it very time consuming to apply to video inpainting tasks. In this paper, we present DiTPainter, an end-to-end video inpainting model based on Diffusion Transformer (DiT). DiTPainter uses an efficient transformer network designed for video inpainting, which is trained from scratch instead of initializing from any large pretrained models. DiTPainter can address videos with arbitrary lengths and can be applied to video decaptioning and video completion tasks with an acceptable time cost. Experiments show that DiTPainter outperforms existing video inpainting algorithms with higher quality and better spatial-temporal consistency.

Paper Structure

This paper contains 4 sections, 4 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Pipeline of our method. Masked frames are first encoded into 3D latents and corresponding masks are downsampled to the same size. We patchify video latents, masks along with random noises and add them together as a sequence of tokens. After the diffusion process conducted through several transformer blocks, we can decode tokens into video frames as our final results.
  • Figure 2: Structure of our transformer block.
  • Figure 3: Results of video completion by ProPainter and our method.
  • Figure 4: Qualitative comparison of video decaptioning between ProPainter and our method.
  • Figure 5: Qualitative comparison of video decaptioning between ProPainter and our method.