DAFT-GAN: Dual Affine Transformation Generative Adversarial Network for Text-Guided Image Inpainting
Jihoon Lee, Yunhong Min, Hwidong Kim, Sangtae Ahn
TL;DR
This work tackles text-guided image inpainting by preserving alignment between a text description and the recovered content. It introduces DAFT-GAN, a framework that uses dual affine transformations to fuse text and image features in the decoder while a separated mask conduit minimizes leakage from uncorrupted regions. With MA-GP adversarial losses, reconstruction and DAMSM guided guidance, and a one-stage dual-path decoding strategy, it achieves state-of-the-art results on MS-COCO, CUB, and Oxford-102 among GAN-based methods. The approach delivers improved semantic fidelity and efficiency, enabling reliable text controlled manipulation with reduced information leakage and faster inference than diffusion-based alternatives.
Abstract
In recent years, there has been a significant focus on research related to text-guided image inpainting. However, the task remains challenging due to several constraints, such as ensuring alignment between the image and the text, and maintaining consistency in distribution between corrupted and uncorrupted regions. In this paper, thus, we propose a dual affine transformation generative adversarial network (DAFT-GAN) to maintain the semantic consistency for text-guided inpainting. DAFT-GAN integrates two affine transformation networks to combine text and image features gradually for each decoding block. Moreover, we minimize information leakage of uncorrupted features for fine-grained image generation by encoding corrupted and uncorrupted regions of the masked image separately. Our proposed model outperforms the existing GAN-based models in both qualitative and quantitative assessments with three benchmark datasets (MS-COCO, CUB, and Oxford) for text-guided image inpainting.
