Table of Contents
Fetching ...

Towards Geographic Inclusion in the Evaluation of Text-to-Image Models

Melissa Hall, Samuel J. Bell, Candace Ross, Adina Williams, Michal Drozdzal, Adriana Romero Soriano

TL;DR

The paper tackles the problem that automatic metrics for text-to-image evaluation may fail to capture the full diversity of human preferences across geographic regions. It introduces a large cross-cultural study across Africa, Europe, and Southeast Asia, collecting over 65,000 annotations to compare human judgments on geographic representation, visual appeal, and object consistency with automated metrics across real GeoDE images and generated images from two public models. Key findings show substantial regional variation in perceptions, limited alignment of Region-CLIPScore with human judgments, and that modern feature extractors (CLIP, DINOv2) better reflect human similarity judgments than traditional Inception-based metrics. The authors propose practical recommendations for more inclusive human and automatic evaluations, including multi-region annotator input, careful selection of reference datasets, and clearer reporting of evaluation assumptions, aiming to improve the fairness and usefulness of T2I model assessments.

Abstract

Rapid progress in text-to-image generative models coupled with their deployment for visual content creation has magnified the importance of thoroughly evaluating their performance and identifying potential biases. In pursuit of models that generate images that are realistic, diverse, visually appealing, and consistent with the given prompt, researchers and practitioners often turn to automated metrics to facilitate scalable and cost-effective performance profiling. However, commonly-used metrics often fail to account for the full diversity of human preference; often even in-depth human evaluations face challenges with subjectivity, especially as interpretations of evaluation criteria vary across regions and cultures. In this work, we conduct a large, cross-cultural study to study how much annotators in Africa, Europe, and Southeast Asia vary in their perception of geographic representation, visual appeal, and consistency in real and generated images from state-of-the art public APIs. We collect over 65,000 image annotations and 20 survey responses. We contrast human annotations with common automated metrics, finding that human preferences vary notably across geographic location and that current metrics do not fully account for this diversity. For example, annotators in different locations often disagree on whether exaggerated, stereotypical depictions of a region are considered geographically representative. In addition, the utility of automatic evaluations is dependent on assumptions about their set-up, such as the alignment of feature extractors with human perception of object similarity or the definition of "appeal" captured in reference datasets used to ground evaluations. We recommend steps for improved automatic and human evaluations.

Towards Geographic Inclusion in the Evaluation of Text-to-Image Models

TL;DR

The paper tackles the problem that automatic metrics for text-to-image evaluation may fail to capture the full diversity of human preferences across geographic regions. It introduces a large cross-cultural study across Africa, Europe, and Southeast Asia, collecting over 65,000 annotations to compare human judgments on geographic representation, visual appeal, and object consistency with automated metrics across real GeoDE images and generated images from two public models. Key findings show substantial regional variation in perceptions, limited alignment of Region-CLIPScore with human judgments, and that modern feature extractors (CLIP, DINOv2) better reflect human similarity judgments than traditional Inception-based metrics. The authors propose practical recommendations for more inclusive human and automatic evaluations, including multi-region annotator input, careful selection of reference datasets, and clearer reporting of evaluation assumptions, aiming to improve the fairness and usefulness of T2I model assessments.

Abstract

Rapid progress in text-to-image generative models coupled with their deployment for visual content creation has magnified the importance of thoroughly evaluating their performance and identifying potential biases. In pursuit of models that generate images that are realistic, diverse, visually appealing, and consistent with the given prompt, researchers and practitioners often turn to automated metrics to facilitate scalable and cost-effective performance profiling. However, commonly-used metrics often fail to account for the full diversity of human preference; often even in-depth human evaluations face challenges with subjectivity, especially as interpretations of evaluation criteria vary across regions and cultures. In this work, we conduct a large, cross-cultural study to study how much annotators in Africa, Europe, and Southeast Asia vary in their perception of geographic representation, visual appeal, and consistency in real and generated images from state-of-the art public APIs. We collect over 65,000 image annotations and 20 survey responses. We contrast human annotations with common automated metrics, finding that human preferences vary notably across geographic location and that current metrics do not fully account for this diversity. For example, annotators in different locations often disagree on whether exaggerated, stereotypical depictions of a region are considered geographically representative. In addition, the utility of automatic evaluations is dependent on assumptions about their set-up, such as the alignment of feature extractors with human perception of object similarity or the definition of "appeal" captured in reference datasets used to ground evaluations. We recommend steps for improved automatic and human evaluations.
Paper Structure (36 sections, 12 figures, 3 tables)

This paper contains 36 sections, 12 figures, 3 tables.

Figures (12)

  • Figure 1: We collect 65k annotations performed by people located in Africa, Europe, and Southeast Asia corresponding to evaluation criteria of text-to-image models including geographic representation, similarity, visual appeal, and object consistency in real and generated images. We develop recommendations for improved human and automatic evaluations of text-to-image models.
  • Figure 2: Random examples of real images from GeoDE in each region and generated images from DM w/ CLIP and LDM 2.1 using the prompt {object} in {region}. The first two columns correspond to Africa, the next to Europe, and the last to Southeast Asia.
  • Figure 3: (a) Proportion of images that in-region and out-of-region annotations consider depicting geographic representation for objects and backgrounds. (b) Relationship between Region-CLIPScore and annotator designation of geographic representation. The x-axis shows the average CLIPScore for all images within a bucket with size 0.01. The y-axis shows the proportion of images where annotators said the object was present. We include 95% confidence intervals generated via bootstrapping. Annotator perceptions of geographic representation differ according to whether the annotators are located in the region of focus or outside it. Region-CLIPScore does not always capture variations in perceived geographic representation across annotator location.
  • Figure 4: Examples of disagreement between in- and out-of-region annotators about geographic representation of objects and backgrounds. Out-of-region annotators tend to consider stereotypes representative more than in-region annotators.
  • Figure 5: (a-b) Qualitative examples of consistent and inconsistent annotator perceptions of similarity, where (a) depicts (from left): reference image from GeoDE and comparison images from DM w/ CLIP designated as more and less similar by all three annotators and (b) depicts (from left): reference image from GeoDE and two comparison images from DM w/ CLIP with inconsistent similarity annotations. (c) Rates of annotator agreement in similarity. Variations in perception of object similarity can depend on the diversity of the images. Factors in perception of similarity include color, size, type (e.g., dog breed), and camera angle.
  • ...and 7 more figures