Table of Contents
Fetching ...

Video Diffusion Models are Strong Video Inpainter

Minhyeok Lee, Suhwan Cho, Chajin Shin, Jungho Lee, Sunghun Yang, Sangyoun Lee

TL;DR

This paper tackles video inpainting by addressing the fragility of flow-propagation methods and the hallucination risks of diffusion-based approaches. It introduces First Frame Filling Video Diffusion Inpainting (FFF-VDI), which propagates latent information from future frames to fill the first frame's latent and then fine-tunes a pre-trained image-to-video diffusion model to generate the remaining frames. Key innovations include the First Frame Filling (FFF) module with latent propagation, a Deformable Noise Alignment (DNA) module for temporal stabilization, and the use of DDIM inversion during inference to curb hallucinations. Empirical results on YouTube-VOS and DAVIS show that FFF-VDI delivers superior perceptual quality and temporal consistency, especially under large or rough masks, while reducing dependency on optical-flow accuracy.

Abstract

Propagation-based video inpainting using optical flow at the pixel or feature level has recently garnered significant attention. However, it has limitations such as the inaccuracy of optical flow prediction and the propagation of noise over time. These issues result in non-uniform noise and time consistency problems throughout the video, which are particularly pronounced when the removed area is large and involves substantial movement. To address these issues, we propose a novel First Frame Filling Video Diffusion Inpainting model (FFF-VDI). We design FFF-VDI inspired by the capabilities of pre-trained image-to-video diffusion models that can transform the first frame image into a highly natural video. To apply this to the video inpainting task, we propagate the noise latent information of future frames to fill the masked areas of the first frame's noise latent code. Next, we fine-tune the pre-trained image-to-video diffusion model to generate the inpainted video. The proposed model addresses the limitations of existing methods that rely on optical flow quality, producing much more natural and temporally consistent videos. This proposed approach is the first to effectively integrate image-to-video diffusion models into video inpainting tasks. Through various comparative experiments, we demonstrate that the proposed model can robustly handle diverse inpainting types with high quality.

Video Diffusion Models are Strong Video Inpainter

TL;DR

This paper tackles video inpainting by addressing the fragility of flow-propagation methods and the hallucination risks of diffusion-based approaches. It introduces First Frame Filling Video Diffusion Inpainting (FFF-VDI), which propagates latent information from future frames to fill the first frame's latent and then fine-tunes a pre-trained image-to-video diffusion model to generate the remaining frames. Key innovations include the First Frame Filling (FFF) module with latent propagation, a Deformable Noise Alignment (DNA) module for temporal stabilization, and the use of DDIM inversion during inference to curb hallucinations. Empirical results on YouTube-VOS and DAVIS show that FFF-VDI delivers superior perceptual quality and temporal consistency, especially under large or rough masks, while reducing dependency on optical-flow accuracy.

Abstract

Propagation-based video inpainting using optical flow at the pixel or feature level has recently garnered significant attention. However, it has limitations such as the inaccuracy of optical flow prediction and the propagation of noise over time. These issues result in non-uniform noise and time consistency problems throughout the video, which are particularly pronounced when the removed area is large and involves substantial movement. To address these issues, we propose a novel First Frame Filling Video Diffusion Inpainting model (FFF-VDI). We design FFF-VDI inspired by the capabilities of pre-trained image-to-video diffusion models that can transform the first frame image into a highly natural video. To apply this to the video inpainting task, we propagate the noise latent information of future frames to fill the masked areas of the first frame's noise latent code. Next, we fine-tune the pre-trained image-to-video diffusion model to generate the inpainted video. The proposed model addresses the limitations of existing methods that rely on optical flow quality, producing much more natural and temporally consistent videos. This proposed approach is the first to effectively integrate image-to-video diffusion models into video inpainting tasks. Through various comparative experiments, we demonstrate that the proposed model can robustly handle diverse inpainting types with high quality.
Paper Structure (9 sections, 5 equations, 7 figures, 2 tables)

This paper contains 9 sections, 5 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Inpainting results of the proposed FFF-VDI and the flow propagation-based ProPainter. When the target object is frequently occluded or structurally difficult to track, large and rough bounding box masks are advantageous for editing.
  • Figure 2: The overall training and testing pipeline structure of our FFF-VDI.
  • Figure 3: The structure of the proposed FFF module. First, the FFF module propagates the latent noise information from each frame to the first frame's latent noise to fill the masked areas. Next, deformable convolution is applied to reconstruct the latent-level distortions and structural information.
  • Figure 4: Qualitative comparisons on both video completion and object removal. The proposed FFF-VDI demonstrates robust video inpainting performance in masked areas compared to the existing flow propagation-based model, ProPainter.
  • Figure 5: Visualization results with and without the proposed FFF module. Per-frame diffusion is the result of inpainting each frame with stable diffusion.
  • ...and 2 more figures