Table of Contents
Fetching ...

TIAM -- A Metric for Evaluating Alignment in Text-to-Image Generation

Paul Grimal, Hervé Le Borgne, Olivier Ferret, Julien Tourille

TL;DR

TIAM introduces a prompt-template–based metric to quantify how closely text prompts are realized in text-to-image generation, addressing seed-driven variability and alignment failures such as catastrophic neglect and attribute binding. The framework defines TIAM as an expectation over Gaussian seeds and template latent concepts, using a detector-based 0/1 score to judge content fidelity, and validates it across multiple diffusion models. Empirical results show alignment degrades as the number of objects grows, that seed choice significantly affects outcomes, and that TIAM correlates more strongly with human judgments than CLIP/BLIP. The work highlights seed mining as a potential lever to improve generation quality and suggests combining TIAM with other quality metrics for comprehensive evaluation.

Abstract

The progress in the generation of synthetic images has made it crucial to assess their quality. While several metrics have been proposed to assess the rendering of images, it is crucial for Text-to-Image (T2I) models, which generate images based on a prompt, to consider additional aspects such as to which extent the generated image matches the important content of the prompt. Moreover, although the generated images usually result from a random starting point, the influence of this one is generally not considered. In this article, we propose a new metric based on prompt templates to study the alignment between the content specified in the prompt and the corresponding generated images. It allows us to better characterize the alignment in terms of the type of the specified objects, their number, and their color. We conducted a study on several recent T2I models about various aspects. An additional interesting result we obtained with our approach is that image quality can vary drastically depending on the noise used as a seed for the images. We also quantify the influence of the number of concepts in the prompt, their order as well as their (color) attributes. Finally, our method allows us to identify some seeds that produce better images than others, opening novel directions of research on this understudied topic.

TIAM -- A Metric for Evaluating Alignment in Text-to-Image Generation

TL;DR

TIAM introduces a prompt-template–based metric to quantify how closely text prompts are realized in text-to-image generation, addressing seed-driven variability and alignment failures such as catastrophic neglect and attribute binding. The framework defines TIAM as an expectation over Gaussian seeds and template latent concepts, using a detector-based 0/1 score to judge content fidelity, and validates it across multiple diffusion models. Empirical results show alignment degrades as the number of objects grows, that seed choice significantly affects outcomes, and that TIAM correlates more strongly with human judgments than CLIP/BLIP. The work highlights seed mining as a potential lever to improve generation quality and suggests combining TIAM with other quality metrics for comprehensive evaluation.

Abstract

The progress in the generation of synthetic images has made it crucial to assess their quality. While several metrics have been proposed to assess the rendering of images, it is crucial for Text-to-Image (T2I) models, which generate images based on a prompt, to consider additional aspects such as to which extent the generated image matches the important content of the prompt. Moreover, although the generated images usually result from a random starting point, the influence of this one is generally not considered. In this article, we propose a new metric based on prompt templates to study the alignment between the content specified in the prompt and the corresponding generated images. It allows us to better characterize the alignment in terms of the type of the specified objects, their number, and their color. We conducted a study on several recent T2I models about various aspects. An additional interesting result we obtained with our approach is that image quality can vary drastically depending on the noise used as a seed for the images. We also quantify the influence of the number of concepts in the prompt, their order as well as their (color) attributes. Finally, our method allows us to identify some seeds that produce better images than others, opening novel directions of research on this understudied topic.
Paper Structure (41 sections, 5 equations, 45 figures, 8 tables)

This paper contains 41 sections, 5 equations, 45 figures, 8 tables.

Figures (45)

  • Figure 1: Images generated with the prompts "a photo of a lion and a bear", "a photo of a blue cat and a yellow car", and "a photo of a red bus driving down the street" generated with Stable diffusion v1.4. (a) The bear is missing, (b) the attributes are swapped, (c) the bus color (red) leaks on the wall
  • Figure 2: Overview of the evaluation pipeline. (1) Generate a dataset of prompts. (2) Generate $n \ge 16$ images per prompt. (3) Detect if the requested labels are present in the images. (4) Compute TIAM. In this example, we do not define attributes.
  • Figure 3: TIAM aggregated per seed for 64 seeds. We show that some starting noise tends to not be converted to an image with two entities regardless of the entities. "+" shows the mean.
  • Figure 4: TIAM with 1 to 4 objects per prompt.
  • Figure 5: The proportion of occurrences of each object, based on its position in the prompt (template with 4 objects).
  • ...and 40 more figures

Theorems & Definitions (3)

  • proof
  • proof
  • proof