DiffuEraser: A Diffusion Model for Video Inpainting
Xiaowen Li, Haolan Xue, Peiran Ren, Liefeng Bo
TL;DR
DiffuEraser tackles video inpainting by combining diffusion-based generation with prior-informed initialization and temporal strategies. The model uses a BrushNet motion module and Propainter-based priors, along with pre-inference and VDM temporal smoothing to extend the temporal receptive field, achieving better detail and consistency than state-of-the-art methods. Experimental results on Panda-70M demonstrate improved content completeness and temporal coherence with competitive efficiency. This approach reduces typical diffusion-model hallucinations in video inpainting and offers a framework applicable to other long-sequence video editing tasks.
Abstract
Recent video inpainting algorithms integrate flow-based pixel propagation with transformer-based generation to leverage optical flow for restoring textures and objects using information from neighboring frames, while completing masked regions through visual Transformers. However, these approaches often encounter blurring and temporal inconsistencies when dealing with large masks, highlighting the need for models with enhanced generative capabilities. Recently, diffusion models have emerged as a prominent technique in image and video generation due to their impressive performance. In this paper, we introduce DiffuEraser, a video inpainting model based on stable diffusion, designed to fill masked regions with greater details and more coherent structures. We incorporate prior information to provide initialization and weak conditioning,which helps mitigate noisy artifacts and suppress hallucinations. Additionally, to improve temporal consistency during long-sequence inference, we expand the temporal receptive fields of both the prior model and DiffuEraser, and further enhance consistency by leveraging the temporal smoothing property of Video Diffusion Models. Experimental results demonstrate that our proposed method outperforms state-of-the-art techniques in both content completeness and temporal consistency while maintaining acceptable efficiency.
