Table of Contents
Fetching ...

Image Inpainting via Conditional Texture and Structure Dual Generation

Xiefan Guo, Hongyu Yang, Di Huang

TL;DR

The paper tackles large-hole image inpainting by introducing a dual-generation framework that separately models structure-constrained texture synthesis and texture-guided structure reconstruction. A Bi-directional Gated Feature Fusion (Bi-GFF) module and a Contextual Feature Aggregation (CFA) module enable robust, cross-modal refinement, while a two-stream discriminator ensures texture and structure fidelity. The approach employs a partial-convolution architecture and a composite loss including ${\mathcal{L}}_{rec}$, ${\mathcal{L}}_{perc}$, ${\mathcal{L}}_{style}$, ${\mathcal{L}}_{adv}$, and ${\mathcal{L}}_{inter}$ to achieve sharp, globally consistent results, validated on CelebA, Paris StreetView, and Places2 with state-of-the-art performance. The work provides extensive ablations demonstrating the benefits of structure priors, dual-generation, and multi-scale contextual aggregation, and the code is released for reproducibility.

Abstract

Deep generative approaches have recently made considerable progress in image inpainting by introducing structure priors. Due to the lack of proper interaction with image texture during structure reconstruction, however, current solutions are incompetent in handling the cases with large corruptions, and they generally suffer from distorted results. In this paper, we propose a novel two-stream network for image inpainting, which models the structure-constrained texture synthesis and texture-guided structure reconstruction in a coupled manner so that they better leverage each other for more plausible generation. Furthermore, to enhance the global consistency, a Bi-directional Gated Feature Fusion (Bi-GFF) module is designed to exchange and combine the structure and texture information and a Contextual Feature Aggregation (CFA) module is developed to refine the generated contents by region affinity learning and multi-scale feature aggregation. Qualitative and quantitative experiments on the CelebA, Paris StreetView and Places2 datasets demonstrate the superiority of the proposed method. Our code is available at https://github.com/Xiefan-Guo/CTSDG.

Image Inpainting via Conditional Texture and Structure Dual Generation

TL;DR

The paper tackles large-hole image inpainting by introducing a dual-generation framework that separately models structure-constrained texture synthesis and texture-guided structure reconstruction. A Bi-directional Gated Feature Fusion (Bi-GFF) module and a Contextual Feature Aggregation (CFA) module enable robust, cross-modal refinement, while a two-stream discriminator ensures texture and structure fidelity. The approach employs a partial-convolution architecture and a composite loss including , , , , and to achieve sharp, globally consistent results, validated on CelebA, Paris StreetView, and Places2 with state-of-the-art performance. The work provides extensive ablations demonstrating the benefits of structure priors, dual-generation, and multi-scale contextual aggregation, and the code is released for reproducibility.

Abstract

Deep generative approaches have recently made considerable progress in image inpainting by introducing structure priors. Due to the lack of proper interaction with image texture during structure reconstruction, however, current solutions are incompetent in handling the cases with large corruptions, and they generally suffer from distorted results. In this paper, we propose a novel two-stream network for image inpainting, which models the structure-constrained texture synthesis and texture-guided structure reconstruction in a coupled manner so that they better leverage each other for more plausible generation. Furthermore, to enhance the global consistency, a Bi-directional Gated Feature Fusion (Bi-GFF) module is designed to exchange and combine the structure and texture information and a Contextual Feature Aggregation (CFA) module is developed to refine the generated contents by region affinity learning and multi-scale feature aggregation. Qualitative and quantitative experiments on the CelebA, Paris StreetView and Places2 datasets demonstrate the superiority of the proposed method. Our code is available at https://github.com/Xiefan-Guo/CTSDG.

Paper Structure

This paper contains 15 sections, 16 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: High-quality inpainting results. From left to right: (a) input corrupted images, (b) our reconstructed structures, (c) our filled results, and (d) ground-truth images.
  • Figure 2: Overview of the proposed method (best viewed in color). Generator: Image inpainting is cast into two subtasks, i.e., structure-constrained texture synthesis (left, blue) and texture-guided structure reconstruction (right, red), and the two parallel-coupled streams borrow encoded deep features from each other. The Bi-directional Gated Feature Fusion (Bi-GFF) module and Contextual Feature Aggregation (CFA) module are stacked at the end of the generator to further refine the results. Discriminator: The texture branch estimates the generated texture, while the structure branch guides structure reconstruction.
  • Figure 3: Illustration of the Bi-directional Gated Feature Fusion (Bi-GFF) module, which entangles the decoded structure and texture features to refine the results.
  • Figure 4: Illustration of the Contextual Feature Aggregation (CFA) module, which models long-term spatial dependency by capturing features at diverse semantic levels.
  • Figure 5: Qualitative comparison on CelebA, Paris StreetView and Places2 (zoom in for a better view): (a) input corrupted images, (b) PatchMatch barnes2009patchmatch, (c) PConv liu2018image, (d) DeepFillv2 yu2019free, (e) RFR li2020recurrent, (f) MED liu2020rethinking, (g) Ours, and (h) ground-truth images.
  • ...and 2 more figures