Table of Contents
Fetching ...

Deep Learning-based Image and Video Inpainting: A Survey

Weize Quan, Jiaxi Chen, Yanli Liu, Dong-Ming Yan, Peter Wonka

TL;DR

Deep learning-based image and video inpainting are surveyed, presenting a taxonomy of deterministic and stochastic approaches, including single-shot, two-stage, and progressive image inpainting, as well as 3D CNN, shift-based, flow-guided, and attention-based video inpainting. It covers architectures (CNNs, VAEs, GANs, transformers, diffusion models), training objectives, datasets, evaluation metrics, and key applications, with a critical discussion of strengths and tradeoffs. The authors highlight open challenges such as large-scale missing regions, uncertainty artifacts, high training costs, slow diffusion-based methods, and ethical considerations, and propose directions like diffusion-model-based inpainting and leveraging large-scale, cross-modal data. The work serves as a practical reference for researchers and practitioners aiming to choose methodologies, benchmark methods, and deploy in real-world contexts.

Abstract

Image and video inpainting is a classic problem in computer vision and computer graphics, aiming to fill in the plausible and realistic content in the missing areas of images and videos. With the advance of deep learning, this problem has achieved significant progress recently. The goal of this paper is to comprehensively review the deep learning-based methods for image and video inpainting. Specifically, we sort existing methods into different categories from the perspective of their high-level inpainting pipeline, present different deep learning architectures, including CNN, VAE, GAN, diffusion models, etc., and summarize techniques for module design. We review the training objectives and the common benchmark datasets. We present evaluation metrics for low-level pixel and high-level perceptional similarity, conduct a performance evaluation, and discuss the strengths and weaknesses of representative inpainting methods. We also discuss related real-world applications. Finally, we discuss open challenges and suggest potential future research directions.

Deep Learning-based Image and Video Inpainting: A Survey

TL;DR

Deep learning-based image and video inpainting are surveyed, presenting a taxonomy of deterministic and stochastic approaches, including single-shot, two-stage, and progressive image inpainting, as well as 3D CNN, shift-based, flow-guided, and attention-based video inpainting. It covers architectures (CNNs, VAEs, GANs, transformers, diffusion models), training objectives, datasets, evaluation metrics, and key applications, with a critical discussion of strengths and tradeoffs. The authors highlight open challenges such as large-scale missing regions, uncertainty artifacts, high training costs, slow diffusion-based methods, and ethical considerations, and propose directions like diffusion-model-based inpainting and leveraging large-scale, cross-modal data. The work serves as a practical reference for researchers and practitioners aiming to choose methodologies, benchmark methods, and deploy in real-world contexts.

Abstract

Image and video inpainting is a classic problem in computer vision and computer graphics, aiming to fill in the plausible and realistic content in the missing areas of images and videos. With the advance of deep learning, this problem has achieved significant progress recently. The goal of this paper is to comprehensively review the deep learning-based methods for image and video inpainting. Specifically, we sort existing methods into different categories from the perspective of their high-level inpainting pipeline, present different deep learning architectures, including CNN, VAE, GAN, diffusion models, etc., and summarize techniques for module design. We review the training objectives and the common benchmark datasets. We present evaluation metrics for low-level pixel and high-level perceptional similarity, conduct a performance evaluation, and discuss the strengths and weaknesses of representative inpainting methods. We also discuss related real-world applications. Finally, we discuss open challenges and suggest potential future research directions.
Paper Structure (36 sections, 12 equations, 17 figures, 6 tables)

This paper contains 36 sections, 12 equations, 17 figures, 6 tables.

Figures (17)

  • Figure 1: Application examples of inpainting techniques: photo restoration (top left: image from bertalmio2000image), text removal (top right: image from bertalmio2000image), undesired target removal (bottom left: image from appstore), and face verification (bottom right: image from Zhang2018DeMeshNet).
  • Figure 2: The rough number of papers on image and video inpainting per year.
  • Figure 3: Representative pipeline of the single-shot inpainting framework. The generator takes as input the concatenation of a binary mask and a corrupted image and outputs the completed image. Training objectives are used for training the generator.
  • Figure 4: Two types of the two-stage inpainting framework: (a) coarse-to-fine yu2018generative where the first network predicts an initial coarse result and the second network predicts a refined result; (b) structure-then-texture nazeri2019edgeconnect where the first network predicts a structure map and the second network predicts a complete image. An apparent difference between these two types is that the structure-then-texture methods explicitly predict the structure map in the first stage.
  • Figure 5: Progressive image inpainting. The image comes from zhang2018semantic.
  • ...and 12 more figures