Table of Contents
Fetching ...

Learning What and Where to Draw

Scott Reed, Zeynep Akata, Santosh Mohan, Samuel Tenka, Bernt Schiele, Honglak Lee

TL;DR

The paper introduces GAWWN, a generative framework that enables text- and location-conditioned image synthesis by separately modeling content and spatial constraints. It presents two main conditioning modalities: bounding boxes and per-part keypoints, along with a conditional keypoint generation mechanism to sample plausible poses from text. On Caltech-UCSD Birds, GAWWN achieves high-resolution 128×128 images conditioned on text and spatial constraints, and extends to human pose generation on MPII, demonstrating broader applicability. The work highlights that decomposing what and where improves realism and offers a versatile interface for controlled generation, with potential for weakly supervised localization learning in future work.

Abstract

Generative Adversarial Networks (GANs) have recently demonstrated the capability to synthesize compelling real-world images, such as room interiors, album covers, manga, faces, birds, and flowers. While existing models can synthesize images based on global constraints such as a class label or caption, they do not provide control over pose or object location. We propose a new model, the Generative Adversarial What-Where Network (GAWWN), that synthesizes images given instructions describing what content to draw in which location. We show high-quality 128 x 128 image synthesis on the Caltech-UCSD Birds dataset, conditioned on both informal text descriptions and also object location. Our system exposes control over both the bounding box around the bird and its constituent parts. By modeling the conditional distributions over part locations, our system also enables conditioning on arbitrary subsets of parts (e.g. only the beak and tail), yielding an efficient interface for picking part locations. We also show preliminary results on the more challenging domain of text- and location-controllable synthesis of images of human actions on the MPII Human Pose dataset.

Learning What and Where to Draw

TL;DR

The paper introduces GAWWN, a generative framework that enables text- and location-conditioned image synthesis by separately modeling content and spatial constraints. It presents two main conditioning modalities: bounding boxes and per-part keypoints, along with a conditional keypoint generation mechanism to sample plausible poses from text. On Caltech-UCSD Birds, GAWWN achieves high-resolution 128×128 images conditioned on text and spatial constraints, and extends to human pose generation on MPII, demonstrating broader applicability. The work highlights that decomposing what and where improves realism and offers a versatile interface for controlled generation, with potential for weakly supervised localization learning in future work.

Abstract

Generative Adversarial Networks (GANs) have recently demonstrated the capability to synthesize compelling real-world images, such as room interiors, album covers, manga, faces, birds, and flowers. While existing models can synthesize images based on global constraints such as a class label or caption, they do not provide control over pose or object location. We propose a new model, the Generative Adversarial What-Where Network (GAWWN), that synthesizes images given instructions describing what content to draw in which location. We show high-quality 128 x 128 image synthesis on the Caltech-UCSD Birds dataset, conditioned on both informal text descriptions and also object location. Our system exposes control over both the bounding box around the bird and its constituent parts. By modeling the conditional distributions over part locations, our system also enables conditioning on arbitrary subsets of parts (e.g. only the beak and tail), yielding an efficient interface for picking part locations. We also show preliminary results on the more challenging domain of text- and location-controllable synthesis of images of human actions on the MPII Human Pose dataset.

Paper Structure

This paper contains 17 sections, 4 equations, 9 figures.

Figures (9)

  • Figure 1: Text-to-image examples. Locations can be specified by keypoint or bounding box.
  • Figure 2: GAWWN with bounding box location control.
  • Figure 3: Text and keypoint-conditional GAWWN.. Keypoint grids are shown as $4 \times 4$ for clarity of presentation, but in our experiments we used $16 \times 16$.
  • Figure 4: Controlling the bird's position using bounding box coordinates. and previously-unseen text.
  • Figure 5: Bird generation conditioned on fixed groundtruth keypoints (overlaid in blue) and previously unseen text. Each sample uses a different random noise vector.
  • ...and 4 more figures