Table of Contents
Fetching ...

Semantic-Guided Two-Stage GAN for Face Inpainting with Hybrid Perceptual Encoding

Abhigyan Bhattacharya, Hiranmoy Roy

TL;DR

The paper tackles the challenge of facial image inpainting under large irregular masks, where preserving identity and facial structure is critical. It introduces a semantic-guided two-stage GAN guided by a hybrid CNN-Transformer perception encoder for Stage1 to generate probabilistic semantic layouts, followed by a Multi-Modal Texture Generator in Stage2 to refine textures coherently. A multi-scale contextual attention mechanism and progressive WGAN-GP-based training support stable learning and high-quality synthesis. Experiments on CelebA-HQ and FFHQ demonstrate improvements in perceptual and structural metrics, with strong qualitative results on challenging masks and clear evidence of semantic preservation.

Abstract

Facial Image inpainting aim is to restore the missing or corrupted regions in face images while preserving identity, structural consistency and photorealistic image quality, a task specifically created for photo restoration. Though there are recent lot of advances in deep generative models, existing methods face problems with large irregular masks, often producing blurry textures on the edges of the masked region, semantic inconsistencies, or unconvincing facial structures due to direct pixel level synthesis approach and limited exploitation of facial priors. In this paper we propose a novel architecture, which address these above challenges through semantic-guided hierarchical synthesis. Our approach starts with a method that organizes and synthesizes information based on meaning, followed by refining the texture. This process gives clear insights into the facial structure before we move on to creating detailed images. In the first stage, we blend two techniques: one that focuses on local features with CNNs and global features with Vision Transformers. This helped us create clear and detailed semantic layouts. In the second stage, we use a Multi-Modal Texture Generator to refine these layouts by pulling in information from different scales, ensuring everything looks cohesive and consistent. The architecture naturally handles arbitrary mask configurations through dynamic attention without maskspecific training. Experiment on two datasets CelebA-HQ and FFHQ shows that our model outperforms other state-of-the-art methods, showing improvements in metrics like LPIPS, PSNR, and SSIM. It produces visually striking results with better semantic preservation, in challenging large-area inpainting situations.

Semantic-Guided Two-Stage GAN for Face Inpainting with Hybrid Perceptual Encoding

TL;DR

The paper tackles the challenge of facial image inpainting under large irregular masks, where preserving identity and facial structure is critical. It introduces a semantic-guided two-stage GAN guided by a hybrid CNN-Transformer perception encoder for Stage1 to generate probabilistic semantic layouts, followed by a Multi-Modal Texture Generator in Stage2 to refine textures coherently. A multi-scale contextual attention mechanism and progressive WGAN-GP-based training support stable learning and high-quality synthesis. Experiments on CelebA-HQ and FFHQ demonstrate improvements in perceptual and structural metrics, with strong qualitative results on challenging masks and clear evidence of semantic preservation.

Abstract

Facial Image inpainting aim is to restore the missing or corrupted regions in face images while preserving identity, structural consistency and photorealistic image quality, a task specifically created for photo restoration. Though there are recent lot of advances in deep generative models, existing methods face problems with large irregular masks, often producing blurry textures on the edges of the masked region, semantic inconsistencies, or unconvincing facial structures due to direct pixel level synthesis approach and limited exploitation of facial priors. In this paper we propose a novel architecture, which address these above challenges through semantic-guided hierarchical synthesis. Our approach starts with a method that organizes and synthesizes information based on meaning, followed by refining the texture. This process gives clear insights into the facial structure before we move on to creating detailed images. In the first stage, we blend two techniques: one that focuses on local features with CNNs and global features with Vision Transformers. This helped us create clear and detailed semantic layouts. In the second stage, we use a Multi-Modal Texture Generator to refine these layouts by pulling in information from different scales, ensuring everything looks cohesive and consistent. The architecture naturally handles arbitrary mask configurations through dynamic attention without maskspecific training. Experiment on two datasets CelebA-HQ and FFHQ shows that our model outperforms other state-of-the-art methods, showing improvements in metrics like LPIPS, PSNR, and SSIM. It produces visually striking results with better semantic preservation, in challenging large-area inpainting situations.

Paper Structure

This paper contains 24 sections, 18 equations, 6 figures, 2 tables, 1 algorithm.

Figures (6)

  • Figure 1: Semantic-Guided 2-Stage GAN with Hybrid Perceptual Encoding Architecture .
  • Figure 2: Qualitative results comparing different ablation settings. The hybrid with attention model shows better texture consistency and structural recovery on CelebA
  • Figure 3: Qualitative results comparing different ablation settings. The hybrid with attention model shows better texture consistency and structural recovery on FFHQ
  • Figure 4: Graphical Representation on Evaluation Metrics on Ablation settings over CelebA and FFHQ.
  • Figure 5: Graphical Representation of FID for different Ablation settings.
  • ...and 1 more figures