Table of Contents
Fetching ...

Semantic Object Accuracy for Generative Text-to-Image Synthesis

Tobias Hinz, Stefan Heinrich, Stefan Wermter

TL;DR

This work tackles the difficulty of generating complex, multi-object scenes conditioned on captions by introducing OP-GAN, which injects explicit object-centric conditioning via dedicated object pathways in both the generator and discriminators. It also introduces Semantic Object Accuracy (SOA), an evaluation metric that uses a pre-trained object detector to verify that caption-mentioned objects appear in the generated images, providing both class- and image-level scores and improving diagnostic insight beyond traditional metrics. Across MS-COCO experiments, OP-GAN variants outperform baselines on standard metrics and especially on SOA, with human studies confirming the SOA ranking. The results demonstrate the value of explicit object modeling for text-to-image synthesis and propose SOA as a practical, caption-aware tool for evaluating and guiding future models.

Abstract

Generative adversarial networks conditioned on textual image descriptions are capable of generating realistic-looking images. However, current methods still struggle to generate images based on complex image captions from a heterogeneous domain. Furthermore, quantitatively evaluating these text-to-image models is challenging, as most evaluation metrics only judge image quality but not the conformity between the image and its caption. To address these challenges we introduce a new model that explicitly models individual objects within an image and a new evaluation metric called Semantic Object Accuracy (SOA) that specifically evaluates images given an image caption. The SOA uses a pre-trained object detector to evaluate if a generated image contains objects that are mentioned in the image caption, e.g. whether an image generated from "a car driving down the street" contains a car. We perform a user study comparing several text-to-image models and show that our SOA metric ranks the models the same way as humans, whereas other metrics such as the Inception Score do not. Our evaluation also shows that models which explicitly model objects outperform models which only model global image characteristics.

Semantic Object Accuracy for Generative Text-to-Image Synthesis

TL;DR

This work tackles the difficulty of generating complex, multi-object scenes conditioned on captions by introducing OP-GAN, which injects explicit object-centric conditioning via dedicated object pathways in both the generator and discriminators. It also introduces Semantic Object Accuracy (SOA), an evaluation metric that uses a pre-trained object detector to verify that caption-mentioned objects appear in the generated images, providing both class- and image-level scores and improving diagnostic insight beyond traditional metrics. Across MS-COCO experiments, OP-GAN variants outperform baselines on standard metrics and especially on SOA, with human studies confirming the SOA ranking. The results demonstrate the value of explicit object modeling for text-to-image synthesis and propose SOA as a practical, caption-aware tool for evaluating and guiding future models.

Abstract

Generative adversarial networks conditioned on textual image descriptions are capable of generating realistic-looking images. However, current methods still struggle to generate images based on complex image captions from a heterogeneous domain. Furthermore, quantitatively evaluating these text-to-image models is challenging, as most evaluation metrics only judge image quality but not the conformity between the image and its caption. To address these challenges we introduce a new model that explicitly models individual objects within an image and a new evaluation metric called Semantic Object Accuracy (SOA) that specifically evaluates images given an image caption. The SOA uses a pre-trained object detector to evaluate if a generated image contains objects that are mentioned in the image caption, e.g. whether an image generated from "a car driving down the street" contains a car. We perform a user study comparing several text-to-image models and show that our SOA metric ranks the models the same way as humans, whereas other metrics such as the Inception Score do not. Our evaluation also shows that models which explicitly model objects outperform models which only model global image characteristics.

Paper Structure

This paper contains 7 sections, 14 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Overview of our model architecture called OP-GAN. The top row shows a high-level summary of our architecture, while the bottom two rows show details of the individual generators and discriminators.
  • Figure 2: Examples when IS fails for COCO images. The top row shows images for which the Inception-Net has very high entropy in its output layer, possibly because the images contain more than one object and are often not centered. The second row shows images containing different objects and scenes which were nonetheless all assigned to the same class by the Inception-Net, thereby negatively affecting the overall predicted diversity in the images.
  • Figure 3: Examples when R-precision fails for COCO images. The top row shows images from the COCO data set. The middle row shows the correct caption and the bottom row gives examples for characteristics of captions that are rated as being more similar than the original caption.
  • Figure 4: Comparison of images generated by different variations of our models.
  • Figure 5: Comparison of SOA scores: SOA per class with degree of a bin reflecting relative frequency of that class.
  • ...and 4 more figures