Table of Contents
Fetching ...

Image Fine-grained Inpainting

Zheng Hui, Jie Li, Xiumei Wang, Xinbo Gao

TL;DR

This paper tackles image inpainting, particularly restoring large missing regions with realistic global structure and fine-grained textures. It introduces a one-stage Dense Multi-Scale Fusion Network (DMFN) built from dense multi-scale fusion blocks (DMFB) and guided by a suite of novel losses: self-guided regression to focus on uncertain regions, geometrical alignment to preserve semantic localization, and discriminator feature matching within a RaGAN framework for local-global consistency. The method combines a global/local discriminator, VGG-based perceptual losses, and an optimized final objective to produce high-quality inpainted results across faces, buildings, and natural scenes, outperforming several state-of-the-art approaches on multiple datasets. Ablation studies confirm the contributions of DMFB, the self-guided regression loss, and the alignment constraint, demonstrating robust improvements in both qualitative appearance and quantitative metrics.

Abstract

Image inpainting techniques have shown promising improvement with the assistance of generative adversarial networks (GANs) recently. However, most of them often suffered from completed results with unreasonable structure or blurriness. To mitigate this problem, in this paper, we present a one-stage model that utilizes dense combinations of dilated convolutions to obtain larger and more effective receptive fields. Benefited from the property of this network, we can more easily recover large regions in an incomplete image. To better train this efficient generator, except for frequently-used VGG feature matching loss, we design a novel self-guided regression loss for concentrating on uncertain areas and enhancing the semantic details. Besides, we devise a geometrical alignment constraint item to compensate for the pixel-based distance between prediction features and ground-truth ones. We also employ a discriminator with local and global branches to ensure local-global contents consistency. To further improve the quality of generated images, discriminator feature matching on the local branch is introduced, which dynamically minimizes the similarity of intermediate features between synthetic and ground-truth patches. Extensive experiments on several public datasets demonstrate that our approach outperforms current state-of-the-art methods. Code is available at https://github.com/Zheng222/DMFN.

Image Fine-grained Inpainting

TL;DR

This paper tackles image inpainting, particularly restoring large missing regions with realistic global structure and fine-grained textures. It introduces a one-stage Dense Multi-Scale Fusion Network (DMFN) built from dense multi-scale fusion blocks (DMFB) and guided by a suite of novel losses: self-guided regression to focus on uncertain regions, geometrical alignment to preserve semantic localization, and discriminator feature matching within a RaGAN framework for local-global consistency. The method combines a global/local discriminator, VGG-based perceptual losses, and an optimized final objective to produce high-quality inpainted results across faces, buildings, and natural scenes, outperforming several state-of-the-art approaches on multiple datasets. Ablation studies confirm the contributions of DMFB, the self-guided regression loss, and the alignment constraint, demonstrating robust improvements in both qualitative appearance and quantitative metrics.

Abstract

Image inpainting techniques have shown promising improvement with the assistance of generative adversarial networks (GANs) recently. However, most of them often suffered from completed results with unreasonable structure or blurriness. To mitigate this problem, in this paper, we present a one-stage model that utilizes dense combinations of dilated convolutions to obtain larger and more effective receptive fields. Benefited from the property of this network, we can more easily recover large regions in an incomplete image. To better train this efficient generator, except for frequently-used VGG feature matching loss, we design a novel self-guided regression loss for concentrating on uncertain areas and enhancing the semantic details. Besides, we devise a geometrical alignment constraint item to compensate for the pixel-based distance between prediction features and ground-truth ones. We also employ a discriminator with local and global branches to ensure local-global contents consistency. To further improve the quality of generated images, discriminator feature matching on the local branch is introduced, which dynamically minimizes the similarity of intermediate features between synthetic and ground-truth patches. Extensive experiments on several public datasets demonstrate that our approach outperforms current state-of-the-art methods. Code is available at https://github.com/Zheng222/DMFN.

Paper Structure

This paper contains 23 sections, 11 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: The inpainted results on FFHQ dataset ffhq by using our method. The missing areas are shown in white. It is worth noting that they also recover well in terms of lighting and texture.
  • Figure 2: The architecture of the proposed dense multi-scale fusion block (DMFB). Here, "Conv-3-8" indicates $3 \times 3$ convolution layer with the dilation rate of $8$ and $\oplus$ is element-wise summation. Instance normalization (IN) and ReLU activation layers followed by the first convolution, second column convolutions and concatenation layer are omitted for brevity. The last convolutional layer only connects an IN layer. The number of output channels for each convolution is set to $64$ except for the last $1 \times 1$ convolution (256 channels) in DMFB.
  • Figure 3: The framework of our method. The activation layer followed by each "convolution + norm" or convolution layer in the generator is omitted for conciseness. The activation function adopts ReLU except for the last convolution (Tanh) in the generator. Blue dotted box indicates our upsampler module (TConv-4 is $4 \times 4$ transposed convolution) and "$s2$" denotes the stride of 2.
  • Figure 4: Visualization of average VGG feature maps.
  • Figure 5: Visualization of guidance maps. (Left) Guidance map $\mathbf{M}_{guidance}^1$ for "relu1_1" layer. (Right) Guidance map $\mathbf{M}_{guidance}^2$ for "relu2_1" layer. These are corresponding to Figure \ref{['fig:avg-features']}.
  • ...and 11 more figures