Table of Contents
Fetching ...

High-Fidelity Image Inpainting with Multimodal Guided GAN Inversion

Libo Zhang, Yongsheng Yu, Jiali Yao, Heng Fan

TL;DR

This work tackles the gapping problem in GAN-inversion-based image inpainting by introducing MMInvertFill, an encoder–generator framework that fuses multimodal priors through a novel $ ext{F} obreak ext{&} obreak ext{W}^+$ latent space and a Soft-update Mean Latent mechanism. The method combines Adaptive Contextual Bottlenecks and Gated Mask-aware Attention within a Multi-modal Mutual Decoder to produce faithful RGB outputs alongside segmentation and edge priors, guided by pre-modulation mappings. Experimental results across six datasets show substantial gains in fidelity and realism, particularly for large holes and out-of-domain images, with competitive or superior FID, LPIPS, and SSIM metrics and robust cross-domain performance without retraining the generator. The approach promises practical impact for high-quality inpainting in diverse scenes and unseen domains, supported by open code and data availability.

Abstract

Generative Adversarial Network (GAN) inversion have demonstrated excellent performance in image inpainting that aims to restore lost or damaged image texture using its unmasked content. Previous GAN inversion-based methods usually utilize well-trained GAN models as effective priors to generate the realistic regions for missing holes. Despite excellence, they ignore a hard constraint that the unmasked regions in the input and the output should be the same, resulting in a gap between GAN inversion and image inpainting and thus degrading the performance. Besides, existing GAN inversion approaches often consider a single modality of the input image, neglecting other auxiliary cues in images for improvements. Addressing these problems, we propose a novel GAN inversion approach, dubbed MMInvertFill, for image inpainting. MMInvertFill contains primarily a multimodal guided encoder with a pre-modulation and a GAN generator with F&W+ latent space. Specifically, the multimodal encoder aims to enhance the multi-scale structures with additional semantic segmentation edge texture modalities through a gated mask-aware attention module. Afterwards, a pre-modulation is presented to encode these structures into style vectors. To mitigate issues of conspicuous color discrepancy and semantic inconsistency, we introduce the F&W+ latent space to bridge the gap between GAN inversion and image inpainting. Furthermore, in order to reconstruct faithful and photorealistic images, we devise a simple yet effective Soft-update Mean Latent module to capture more diversified in-domain patterns for generating high-fidelity textures for massive corruptions. In our extensive experiments on six challenging datasets, we show that our MMInvertFill qualitatively and quantitatively outperforms other state-of-the-arts and it supports the completion of out-of-domain images effectively.

High-Fidelity Image Inpainting with Multimodal Guided GAN Inversion

TL;DR

This work tackles the gapping problem in GAN-inversion-based image inpainting by introducing MMInvertFill, an encoder–generator framework that fuses multimodal priors through a novel latent space and a Soft-update Mean Latent mechanism. The method combines Adaptive Contextual Bottlenecks and Gated Mask-aware Attention within a Multi-modal Mutual Decoder to produce faithful RGB outputs alongside segmentation and edge priors, guided by pre-modulation mappings. Experimental results across six datasets show substantial gains in fidelity and realism, particularly for large holes and out-of-domain images, with competitive or superior FID, LPIPS, and SSIM metrics and robust cross-domain performance without retraining the generator. The approach promises practical impact for high-quality inpainting in diverse scenes and unseen domains, supported by open code and data availability.

Abstract

Generative Adversarial Network (GAN) inversion have demonstrated excellent performance in image inpainting that aims to restore lost or damaged image texture using its unmasked content. Previous GAN inversion-based methods usually utilize well-trained GAN models as effective priors to generate the realistic regions for missing holes. Despite excellence, they ignore a hard constraint that the unmasked regions in the input and the output should be the same, resulting in a gap between GAN inversion and image inpainting and thus degrading the performance. Besides, existing GAN inversion approaches often consider a single modality of the input image, neglecting other auxiliary cues in images for improvements. Addressing these problems, we propose a novel GAN inversion approach, dubbed MMInvertFill, for image inpainting. MMInvertFill contains primarily a multimodal guided encoder with a pre-modulation and a GAN generator with F&W+ latent space. Specifically, the multimodal encoder aims to enhance the multi-scale structures with additional semantic segmentation edge texture modalities through a gated mask-aware attention module. Afterwards, a pre-modulation is presented to encode these structures into style vectors. To mitigate issues of conspicuous color discrepancy and semantic inconsistency, we introduce the F&W+ latent space to bridge the gap between GAN inversion and image inpainting. Furthermore, in order to reconstruct faithful and photorealistic images, we devise a simple yet effective Soft-update Mean Latent module to capture more diversified in-domain patterns for generating high-fidelity textures for massive corruptions. In our extensive experiments on six challenging datasets, we show that our MMInvertFill qualitatively and quantitatively outperforms other state-of-the-arts and it supports the completion of out-of-domain images effectively.

Paper Structure

This paper contains 29 sections, 10 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: The proposed method supports seamless high-fidelity large hole image inpainting and out-of-domain restoration.
  • Figure 2: A high-level schematic diagram illustrating the full pipeline of the proposed method.
  • Figure 3: Illustration of our MMInvertFill, including multi-modal guided encoder (image (a)), feature pyramid-based mapping networks (image (b)), mapping network with pre-modulation network (image (c)) and StyleGAN2 generator with our proposed $\mathcal{F} \& \mathcal{W}^+$ latent space (image (d)).
  • Figure 4: Illustration of Adaptive Contextual Bottleneck.
  • Figure 5: Illustration of Gated Mask-aware Attention.
  • ...and 8 more figures