Table of Contents
Fetching ...

DiffuEraser: A Diffusion Model for Video Inpainting

Xiaowen Li, Haolan Xue, Peiran Ren, Liefeng Bo

TL;DR

DiffuEraser tackles video inpainting by combining diffusion-based generation with prior-informed initialization and temporal strategies. The model uses a BrushNet motion module and Propainter-based priors, along with pre-inference and VDM temporal smoothing to extend the temporal receptive field, achieving better detail and consistency than state-of-the-art methods. Experimental results on Panda-70M demonstrate improved content completeness and temporal coherence with competitive efficiency. This approach reduces typical diffusion-model hallucinations in video inpainting and offers a framework applicable to other long-sequence video editing tasks.

Abstract

Recent video inpainting algorithms integrate flow-based pixel propagation with transformer-based generation to leverage optical flow for restoring textures and objects using information from neighboring frames, while completing masked regions through visual Transformers. However, these approaches often encounter blurring and temporal inconsistencies when dealing with large masks, highlighting the need for models with enhanced generative capabilities. Recently, diffusion models have emerged as a prominent technique in image and video generation due to their impressive performance. In this paper, we introduce DiffuEraser, a video inpainting model based on stable diffusion, designed to fill masked regions with greater details and more coherent structures. We incorporate prior information to provide initialization and weak conditioning,which helps mitigate noisy artifacts and suppress hallucinations. Additionally, to improve temporal consistency during long-sequence inference, we expand the temporal receptive fields of both the prior model and DiffuEraser, and further enhance consistency by leveraging the temporal smoothing property of Video Diffusion Models. Experimental results demonstrate that our proposed method outperforms state-of-the-art techniques in both content completeness and temporal consistency while maintaining acceptable efficiency.

DiffuEraser: A Diffusion Model for Video Inpainting

TL;DR

DiffuEraser tackles video inpainting by combining diffusion-based generation with prior-informed initialization and temporal strategies. The model uses a BrushNet motion module and Propainter-based priors, along with pre-inference and VDM temporal smoothing to extend the temporal receptive field, achieving better detail and consistency than state-of-the-art methods. Experimental results on Panda-70M demonstrate improved content completeness and temporal coherence with competitive efficiency. This approach reduces typical diffusion-model hallucinations in video inpainting and offers a framework applicable to other long-sequence video editing tasks.

Abstract

Recent video inpainting algorithms integrate flow-based pixel propagation with transformer-based generation to leverage optical flow for restoring textures and objects using information from neighboring frames, while completing masked regions through visual Transformers. However, these approaches often encounter blurring and temporal inconsistencies when dealing with large masks, highlighting the need for models with enhanced generative capabilities. Recently, diffusion models have emerged as a prominent technique in image and video generation due to their impressive performance. In this paper, we introduce DiffuEraser, a video inpainting model based on stable diffusion, designed to fill masked regions with greater details and more coherent structures. We incorporate prior information to provide initialization and weak conditioning,which helps mitigate noisy artifacts and suppress hallucinations. Additionally, to improve temporal consistency during long-sequence inference, we expand the temporal receptive fields of both the prior model and DiffuEraser, and further enhance consistency by leveraging the temporal smoothing property of Video Diffusion Models. Experimental results demonstrate that our proposed method outperforms state-of-the-art techniques in both content completeness and temporal consistency while maintaining acceptable efficiency.
Paper Structure (10 sections, 13 figures)

This paper contains 10 sections, 13 figures.

Figures (13)

  • Figure 1: Performance comparison between the proposed model, DiffuEraser, and Propainter. (a) Texture Quality:DiffuEraser generates more detailed and refined textures compared to the transformer-based Propainter. (b) Temporal Consistency:DiffuEraser demonstrates superior temporal consistency in the inpainted content compared to Propainter.
  • Figure 2: Overview of the proposed video inpainting model DiffuEraser, based on stable diffusion. The main denoising UNet performs the denoising process to generate the final output. The BrushNet branch extracts features from masked images, which are added to the main denoising UNet layer by layer after a zero convolution block. Temporal attention is incorporated after self-attention and cross-attention to improve temporal consistency.
  • Figure 3: Example of noisy artifacts generated by the model. The masked region above the sea level is not completed correctly and resembles random noise.
  • Figure 4: Incorporation of priors. We introduce priors during inference by performing DDIM inversion on the outputs of the prior model and adding them to the noisy latent.
  • Figure 5: Comparison of inpainting results before and after incorporating priors.
  • ...and 8 more figures