Table of Contents
Fetching ...

Text-Adaptive Generative Adversarial Networks: Manipulating Images with Natural Language

Seonghyeon Nam, Yunji Kim, Seon Joo Kim

TL;DR

The paper addresses manipulating images with natural language by introducing TAGAN, a GAN framework that uses a text-adaptive discriminator composed of word-level discriminators to provide fine-grained feedback for modifying only described attributes while preserving text-irrelevant content. The generator encodes the input image and text, applying a reconstruction loss to keep non-described content intact, while the discriminator's word-level attention guides targeted attribute changes. TAGAN demonstrates superiority over prior methods on CUB and Oxford-102 through both quantitative and qualitative evaluations, including human judgments that favor its outputs and analyses showing effective attribute disentanglement and content preservation. The approach advances multi-modal image editing by enabling region-specific, language-guided manipulations without discarding background structure or unrelated content.

Abstract

This paper addresses the problem of manipulating images using natural language description. Our task aims to semantically modify visual attributes of an object in an image according to the text describing the new visual appearance. Although existing methods synthesize images having new attributes, they do not fully preserve text-irrelevant contents of the original image. In this paper, we propose the text-adaptive generative adversarial network (TAGAN) to generate semantically manipulated images while preserving text-irrelevant contents. The key to our method is the text-adaptive discriminator that creates word-level local discriminators according to input text to classify fine-grained attributes independently. With this discriminator, the generator learns to generate images where only regions that correspond to the given text are modified. Experimental results show that our method outperforms existing methods on CUB and Oxford-102 datasets, and our results were mostly preferred on a user study. Extensive analysis shows that our method is able to effectively disentangle visual attributes and produce pleasing outputs.

Text-Adaptive Generative Adversarial Networks: Manipulating Images with Natural Language

TL;DR

The paper addresses manipulating images with natural language by introducing TAGAN, a GAN framework that uses a text-adaptive discriminator composed of word-level discriminators to provide fine-grained feedback for modifying only described attributes while preserving text-irrelevant content. The generator encodes the input image and text, applying a reconstruction loss to keep non-described content intact, while the discriminator's word-level attention guides targeted attribute changes. TAGAN demonstrates superiority over prior methods on CUB and Oxford-102 through both quantitative and qualitative evaluations, including human judgments that favor its outputs and analyses showing effective attribute disentanglement and content preservation. The approach advances multi-modal image editing by enabling region-specific, language-guided manipulations without discarding background structure or unrelated content.

Abstract

This paper addresses the problem of manipulating images using natural language description. Our task aims to semantically modify visual attributes of an object in an image according to the text describing the new visual appearance. Although existing methods synthesize images having new attributes, they do not fully preserve text-irrelevant contents of the original image. In this paper, we propose the text-adaptive generative adversarial network (TAGAN) to generate semantically manipulated images while preserving text-irrelevant contents. The key to our method is the text-adaptive discriminator that creates word-level local discriminators according to input text to classify fine-grained attributes independently. With this discriminator, the generator learns to generate images where only regions that correspond to the given text are modified. Experimental results show that our method outperforms existing methods on CUB and Oxford-102 datasets, and our results were mostly preferred on a user study. Extensive analysis shows that our method is able to effectively disentangle visual attributes and produce pleasing outputs.

Paper Structure

This paper contains 14 sections, 7 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Examples of image manipulation using natural language description. Existing methods produce reasonable results, but fail to preserve text-irrelevant contents such as the background of the original image. In comparison, our method accurately manipulates images according to the text while preserving text-irrelevant contents.
  • Figure 2: The proposed GAN structure. (a) shows the overall GAN architecture and (b) depicts our text-adaptive discriminator. In (b), the attention and the layer-wise weight are omitted for simplicity.
  • Figure 3: Qualitative results of our method on CUB and Oxford-102 datasets.
  • Figure 4: Qualitative comparison of three methods. In most cases, our method outperforms baseline methods qualitatively. The rightmost column shows a failure case using our method.
  • Figure 5: Visualization of the text-adaptive discriminator. From top to bottom, the top-3 word attentions are shown. From left to right, the saliency maps of 3 layer-wise local discriminators are visualized. Each fractional number is $\beta_{ij}$. Note that $\sum_j{\beta_{ij}}=1$.
  • ...and 2 more figures