Table of Contents
Fetching ...

DAFT-GAN: Dual Affine Transformation Generative Adversarial Network for Text-Guided Image Inpainting

Jihoon Lee, Yunhong Min, Hwidong Kim, Sangtae Ahn

TL;DR

This work tackles text-guided image inpainting by preserving alignment between a text description and the recovered content. It introduces DAFT-GAN, a framework that uses dual affine transformations to fuse text and image features in the decoder while a separated mask conduit minimizes leakage from uncorrupted regions. With MA-GP adversarial losses, reconstruction and DAMSM guided guidance, and a one-stage dual-path decoding strategy, it achieves state-of-the-art results on MS-COCO, CUB, and Oxford-102 among GAN-based methods. The approach delivers improved semantic fidelity and efficiency, enabling reliable text controlled manipulation with reduced information leakage and faster inference than diffusion-based alternatives.

Abstract

In recent years, there has been a significant focus on research related to text-guided image inpainting. However, the task remains challenging due to several constraints, such as ensuring alignment between the image and the text, and maintaining consistency in distribution between corrupted and uncorrupted regions. In this paper, thus, we propose a dual affine transformation generative adversarial network (DAFT-GAN) to maintain the semantic consistency for text-guided inpainting. DAFT-GAN integrates two affine transformation networks to combine text and image features gradually for each decoding block. Moreover, we minimize information leakage of uncorrupted features for fine-grained image generation by encoding corrupted and uncorrupted regions of the masked image separately. Our proposed model outperforms the existing GAN-based models in both qualitative and quantitative assessments with three benchmark datasets (MS-COCO, CUB, and Oxford) for text-guided image inpainting.

DAFT-GAN: Dual Affine Transformation Generative Adversarial Network for Text-Guided Image Inpainting

TL;DR

This work tackles text-guided image inpainting by preserving alignment between a text description and the recovered content. It introduces DAFT-GAN, a framework that uses dual affine transformations to fuse text and image features in the decoder while a separated mask conduit minimizes leakage from uncorrupted regions. With MA-GP adversarial losses, reconstruction and DAMSM guided guidance, and a one-stage dual-path decoding strategy, it achieves state-of-the-art results on MS-COCO, CUB, and Oxford-102 among GAN-based methods. The approach delivers improved semantic fidelity and efficiency, enabling reliable text controlled manipulation with reduced information leakage and faster inference than diffusion-based alternatives.

Abstract

In recent years, there has been a significant focus on research related to text-guided image inpainting. However, the task remains challenging due to several constraints, such as ensuring alignment between the image and the text, and maintaining consistency in distribution between corrupted and uncorrupted regions. In this paper, thus, we propose a dual affine transformation generative adversarial network (DAFT-GAN) to maintain the semantic consistency for text-guided inpainting. DAFT-GAN integrates two affine transformation networks to combine text and image features gradually for each decoding block. Moreover, we minimize information leakage of uncorrupted features for fine-grained image generation by encoding corrupted and uncorrupted regions of the masked image separately. Our proposed model outperforms the existing GAN-based models in both qualitative and quantitative assessments with three benchmark datasets (MS-COCO, CUB, and Oxford) for text-guided image inpainting.
Paper Structure (21 sections, 17 equations, 9 figures, 3 tables)

This paper contains 21 sections, 17 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Results of proposed DAFT-GAN. Masked (left), generated (middle), and ground-truth (right) images are presented on three datasets (MS-COCO, CUB, and Oxford).
  • Figure 2: Architecture of DAFT-GAN consisting of an encoder-decoder generator and a one-way discriminator. The generator extracts image features and combines them with noise and text embeddings to generate the reconstructed images.
  • Figure 3: Visualization of the SMC block. Encoding includes convolution and normalization, generating higher-dimensional features that are downscaled by a factor of 2.
  • Figure 4: Structure of the DAFT block. The block is composed of RAT and MCAT modules, which respectively handle the global path and the spatial path, thereby forming a dual path architecture.
  • Figure 5: Diagram of CrossAffine module. The module manipulates input image features using recurrent hidden features and word features by channel wise affine transformation.
  • ...and 4 more figures