Table of Contents
Fetching ...

Aggregated Contextual Transformations for High-Resolution Image Inpainting

Yanhong Zeng, Jianlong Fu, Hongyang Chao, Baining Guo

TL;DR

An enhanced GAN-based model, named AOT-GAN, for high-resolution image inpainting that outperforms the state-of-the-art in context reasoning and texture synthesis and is evaluated in practical applications, e.g., logo removal, face editing, and object removal.

Abstract

State-of-the-art image inpainting approaches can suffer from generating distorted structures and blurry textures in high-resolution images (e.g., 512x512). The challenges mainly drive from (1) image content reasoning from distant contexts, and (2) fine-grained texture synthesis for a large missing region. To overcome these two challenges, we propose an enhanced GAN-based model, named Aggregated COntextual-Transformation GAN (AOT-GAN), for high-resolution image inpainting. Specifically, to enhance context reasoning, we construct the generator of AOT-GAN by stacking multiple layers of a proposed AOT block. The AOT blocks aggregate contextual transformations from various receptive fields, allowing to capture both informative distant image contexts and rich patterns of interest for context reasoning. For improving texture synthesis, we enhance the discriminator of AOT-GAN by training it with a tailored mask-prediction task. Such a training objective forces the discriminator to distinguish the detailed appearances of real and synthesized patches, and in turn, facilitates the generator to synthesize clear textures. Extensive comparisons on Places2, the most challenging benchmark with 1.8 million high-resolution images of 365 complex scenes, show that our model outperforms the state-of-the-art by a significant margin in terms of FID with 38.60% relative improvement. A user study including more than 30 subjects further validates the superiority of AOT-GAN. We further evaluate the proposed AOT-GAN in practical applications, e.g., logo removal, face editing, and object removal. Results show that our model achieves promising completions in the real world. We release code and models in https://github.com/researchmm/AOT-GAN-for-Inpainting.

Aggregated Contextual Transformations for High-Resolution Image Inpainting

TL;DR

An enhanced GAN-based model, named AOT-GAN, for high-resolution image inpainting that outperforms the state-of-the-art in context reasoning and texture synthesis and is evaluated in practical applications, e.g., logo removal, face editing, and object removal.

Abstract

State-of-the-art image inpainting approaches can suffer from generating distorted structures and blurry textures in high-resolution images (e.g., 512x512). The challenges mainly drive from (1) image content reasoning from distant contexts, and (2) fine-grained texture synthesis for a large missing region. To overcome these two challenges, we propose an enhanced GAN-based model, named Aggregated COntextual-Transformation GAN (AOT-GAN), for high-resolution image inpainting. Specifically, to enhance context reasoning, we construct the generator of AOT-GAN by stacking multiple layers of a proposed AOT block. The AOT blocks aggregate contextual transformations from various receptive fields, allowing to capture both informative distant image contexts and rich patterns of interest for context reasoning. For improving texture synthesis, we enhance the discriminator of AOT-GAN by training it with a tailored mask-prediction task. Such a training objective forces the discriminator to distinguish the detailed appearances of real and synthesized patches, and in turn, facilitates the generator to synthesize clear textures. Extensive comparisons on Places2, the most challenging benchmark with 1.8 million high-resolution images of 365 complex scenes, show that our model outperforms the state-of-the-art by a significant margin in terms of FID with 38.60% relative improvement. A user study including more than 30 subjects further validates the superiority of AOT-GAN. We further evaluate the proposed AOT-GAN in practical applications, e.g., logo removal, face editing, and object removal. Results show that our model achieves promising completions in the real world. We release code and models in https://github.com/researchmm/AOT-GAN-for-Inpainting.

Paper Structure

This paper contains 26 sections, 8 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: An illustration of image inpainting. Given an image (a) (resolution: $512\times 512$) with large irregular holes, our model is able to reconstruct more plausible structures and clearer textures of the window in (d) compared with GatedConv yu2019free (b) and HiFill yi2020contextual (c). We present enlarged patches by red boxes next to the images.
  • Figure 2: The overview of the proposed Aggregated COntextual-Transformation GAN (AOT-GAN). AOT-GAN is built upon a generative adversarial network (GAN), which consists of a generator and a discriminator. Specifically, the generator is highly-modularized by stacking multiple-layers of a carefully-designed block, i.e., AOT block, for enhancing context reasoning. The discriminator is trained by a tailored mask-prediction task, which aims at predicting downsampled patch-level inpainting masks. Our model is jointly optimized by a reconstruction loss, an adversarial loss goodfellow2014generative, a perceptual loss johnson2016perceptual and a style loss gatys2016image. Details can be found in Section \ref{['sec:approach']}.
  • Figure 3: An illustration of the Residual block (a) used in the state-of-the-art deep inpainting models nazeri2019edgeconnectxie2019imageren2019structureflow and the proposed AOT block (b). The numbers inside orange blocks denote (#input channels, filter sizes, dilation rates, #output channels).
  • Figure 4: An illustration of different tasks for the training of the discriminator. PatchGAN aims at distinguishing patches of inpainted images from those of real images, while HM-PatchGAN and SM-PatchGAN aim to segment synthesized patches of missing regions from real ones of contexts according to inpainting masks.
  • Figure 5: Qualitative comparisons of AOT-GAN with CA yu2018generative, PEN-Net zeng2019learning, PConv liu2018image, EdgeConnect nazeri2019edgeconnect, GatedConv yu2019free and HiFill yi2020contextual on Places2 zhou2018places. Each column shows the results and their enlarged patches marked by red boxes next to them. As shown in these cases, our model is able to reconstruct more plausible structures and clearer textures of various scenes, including valley, street, field wild and alcove. All the images are center-cropped and resized to $512\times 512$. See analysis in Section \ref{['sec:qual']}. [Best viewed with zooming-in]
  • ...and 7 more figures