Table of Contents
Fetching ...

BVINet: Unlocking Blind Video Inpainting with Zero Annotations

Zhiliang Wu, Kerui Chen, Kun Li, Hehe Fan, Yi Yang

TL;DR

BVINet tackles blind video inpainting by eliminating the need for corrupted-region masks and jointly learning where to inpaint and how to inpaint. It consists of a Mask Prediction Network (MPNet) and a Video Completion Network (VCNet) connected through a consistency loss that enforces mutual refinement, enabling accurate localization and realistic content filling. MPNet combines short-term prediction with a long-term transformer for temporal coherence, while VCNet uses a Wavelet Sparse Transformer with Discrete Wavelet Transform to perform frequency-aware, noise-robust inpainting that restricts attention to valid regions. A customized dataset with free-form stroke corruptions and bullet-removal clips supports robust evaluation, and experiments show state-of-the-art performance in blind settings with competitive results versus non-blind methods, confirmed by thorough ablations. This work advances practical blind video restoration by removing annotation bottlenecks and enabling scalable, annotation-free video inpainting.

Abstract

Video inpainting aims to fill in corrupted regions of the video with plausible contents. Existing methods generally assume that the locations of corrupted regions are known, focusing primarily on the "how to inpaint". This reliance necessitates manual annotation of the corrupted regions using binary masks to indicate "whereto inpaint". However, the annotation of these masks is labor-intensive and expensive, limiting the practicality of current methods. In this paper, we expect to relax this assumption by defining a new blind video inpainting setting, enabling the networks to learn the mapping from corrupted video to inpainted result directly, eliminating the need of corrupted region annotations. Specifically, we propose an end-to-end blind video inpainting network (BVINet) to address both "where to inpaint" and "how to inpaint" simultaneously. On the one hand, BVINet can predict the masks of corrupted regions by detecting semantic-discontinuous regions of the frame and utilizing temporal consistency prior of the video. On the other hand, the predicted masks are incorporated into the BVINet, allowing it to capture valid context information from uncorrupted regions to fill in corrupted ones. Besides, we introduce a consistency loss to regularize the training parameters of BVINet. In this way, mask prediction and video completion mutually constrain each other, thereby maximizing the overall performance of the trained model. Furthermore, we customize a dataset consisting of synthetic corrupted videos, real-world corrupted videos, and their corresponding completed videos. This dataset serves as a valuable resource for advancing blind video inpainting research. Extensive experimental results demonstrate the effectiveness and superiority of our method.

BVINet: Unlocking Blind Video Inpainting with Zero Annotations

TL;DR

BVINet tackles blind video inpainting by eliminating the need for corrupted-region masks and jointly learning where to inpaint and how to inpaint. It consists of a Mask Prediction Network (MPNet) and a Video Completion Network (VCNet) connected through a consistency loss that enforces mutual refinement, enabling accurate localization and realistic content filling. MPNet combines short-term prediction with a long-term transformer for temporal coherence, while VCNet uses a Wavelet Sparse Transformer with Discrete Wavelet Transform to perform frequency-aware, noise-robust inpainting that restricts attention to valid regions. A customized dataset with free-form stroke corruptions and bullet-removal clips supports robust evaluation, and experiments show state-of-the-art performance in blind settings with competitive results versus non-blind methods, confirmed by thorough ablations. This work advances practical blind video restoration by removing annotation bottlenecks and enabling scalable, annotation-free video inpainting.

Abstract

Video inpainting aims to fill in corrupted regions of the video with plausible contents. Existing methods generally assume that the locations of corrupted regions are known, focusing primarily on the "how to inpaint". This reliance necessitates manual annotation of the corrupted regions using binary masks to indicate "whereto inpaint". However, the annotation of these masks is labor-intensive and expensive, limiting the practicality of current methods. In this paper, we expect to relax this assumption by defining a new blind video inpainting setting, enabling the networks to learn the mapping from corrupted video to inpainted result directly, eliminating the need of corrupted region annotations. Specifically, we propose an end-to-end blind video inpainting network (BVINet) to address both "where to inpaint" and "how to inpaint" simultaneously. On the one hand, BVINet can predict the masks of corrupted regions by detecting semantic-discontinuous regions of the frame and utilizing temporal consistency prior of the video. On the other hand, the predicted masks are incorporated into the BVINet, allowing it to capture valid context information from uncorrupted regions to fill in corrupted ones. Besides, we introduce a consistency loss to regularize the training parameters of BVINet. In this way, mask prediction and video completion mutually constrain each other, thereby maximizing the overall performance of the trained model. Furthermore, we customize a dataset consisting of synthetic corrupted videos, real-world corrupted videos, and their corresponding completed videos. This dataset serves as a valuable resource for advancing blind video inpainting research. Extensive experimental results demonstrate the effectiveness and superiority of our method.

Paper Structure

This paper contains 16 sections, 9 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Fig.(a) shows the general pipeline of existing non-blind video inpainting. Such pipeline require manual annotation of corrupted regions of each frame, limiting its application scope. In this paper, we formulate a new task: blind video inpainting, which can directly learn a mapping from corrupted video to inpainted result without any corrupted region annotation (Fig.(b)). Fig.(c) shows an example of our blind video inpainting method in scratch restoration and bullet removal.
  • Figure 2: The overview of the proposed blind video inpainting framework. Our framework are composed of a mask prediction network (MPNet) and a video completion network (VCNet). The former aims to predict the masks of corrupted regions by detecting semantic-discontinuous regions of the frame and utilizing temporal consistency prior of the video, while the latter perceive valid context information from uncorrupted regions using predicted mask to generate corrupted contents.
  • Figure 3: Three example of inpainting results with our method. The top row shows corrupted video frame. The completed results are shown in the bottom row, where green box denotes the mask generated by the model.
  • Figure 4: Qualitative results compared with OGNet phutke2023blind and RAVUNet agnolucci2022restoration on bullet removal.
  • Figure 5: Example of corrupted regions segmentation.
  • ...and 2 more figures