Table of Contents
Fetching ...

Decomposed evaluations of geographic disparities in text-to-image models

Abhishek Sureddy, Dishant Padalia, Nandhinee Periyakaruppa, Oindrila Saha, Adina Williams, Adriana Romero-Soriano, Megan Richards, Polina Kirichenko, Melissa Hall

TL;DR

This work tackles geographic disparities in text-to-image generation by disentangling the object and background components contributing to bias. It introduces Decomposed-DIG, which extends DIG indicators with object-background segmentation using LangSAM and ViT-based features to yield Obj-only and BG-only benchmarks. Empirically, objects are generally more realistic than backgrounds, and background disparities across regions are larger, with Africa and Europe illustrating key failure modes such as missing red sedans and outdoor placements of indoor items. A prompting mitigation using regional adjectives significantly boosts background diversity (up to $52\%$ in the worst region and $20\%$ on average) while preserving object realism, demonstrating the utility of fine-grained evaluation for informing practical fixes. Overall, Decomposed-DIG provides a precise, actionable framework for diagnosing and reducing geographic biases in text-to-image systems.

Abstract

Recent work has identified substantial disparities in generated images of different geographic regions, including stereotypical depictions of everyday objects like houses and cars. However, existing measures for these disparities have been limited to either human evaluations, which are time-consuming and costly, or automatic metrics evaluating full images, which are unable to attribute these disparities to specific parts of the generated images. In this work, we introduce a new set of metrics, Decomposed Indicators of Disparities in Image Generation (Decomposed-DIG), that allows us to separately measure geographic disparities in the depiction of objects and backgrounds in generated images. Using Decomposed-DIG, we audit a widely used latent diffusion model and find that generated images depict objects with better realism than backgrounds and that backgrounds in generated images tend to contain larger regional disparities than objects. We use Decomposed-DIG to pinpoint specific examples of disparities, such as stereotypical background generation in Africa, struggling to generate modern vehicles in Africa, and unrealistically placing some objects in outdoor settings. Informed by our metric, we use a new prompting structure that enables a 52% worst-region improvement and a 20% average improvement in generated background diversity.

Decomposed evaluations of geographic disparities in text-to-image models

TL;DR

This work tackles geographic disparities in text-to-image generation by disentangling the object and background components contributing to bias. It introduces Decomposed-DIG, which extends DIG indicators with object-background segmentation using LangSAM and ViT-based features to yield Obj-only and BG-only benchmarks. Empirically, objects are generally more realistic than backgrounds, and background disparities across regions are larger, with Africa and Europe illustrating key failure modes such as missing red sedans and outdoor placements of indoor items. A prompting mitigation using regional adjectives significantly boosts background diversity (up to in the worst region and on average) while preserving object realism, demonstrating the utility of fine-grained evaluation for informing practical fixes. Overall, Decomposed-DIG provides a precise, actionable framework for diagnosing and reducing geographic biases in text-to-image systems.

Abstract

Recent work has identified substantial disparities in generated images of different geographic regions, including stereotypical depictions of everyday objects like houses and cars. However, existing measures for these disparities have been limited to either human evaluations, which are time-consuming and costly, or automatic metrics evaluating full images, which are unable to attribute these disparities to specific parts of the generated images. In this work, we introduce a new set of metrics, Decomposed Indicators of Disparities in Image Generation (Decomposed-DIG), that allows us to separately measure geographic disparities in the depiction of objects and backgrounds in generated images. Using Decomposed-DIG, we audit a widely used latent diffusion model and find that generated images depict objects with better realism than backgrounds and that backgrounds in generated images tend to contain larger regional disparities than objects. We use Decomposed-DIG to pinpoint specific examples of disparities, such as stereotypical background generation in Africa, struggling to generate modern vehicles in Africa, and unrealistically placing some objects in outdoor settings. Informed by our metric, we use a new prompting structure that enables a 52% worst-region improvement and a 20% average improvement in generated background diversity.
Paper Structure (37 sections, 2 equations, 14 figures, 1 table)

This paper contains 37 sections, 2 equations, 14 figures, 1 table.

Figures (14)

  • Figure 1: We introduce Decomposed-DIG, which decomposes measurements of geographic disparities in text-to-image generation between object and background representations. Using Decomposed-DIG, we identify generation patterns that contribute to geographic disparities.
  • Figure 2: Examples of real and generated images of objects in different regions (Europe in top row and Africa in the bottom row). We propose Decomposed-DIG to pinpoint geographic disparities related to the depiction of objects and backgrounds in generated images created with the prompt {object} in {region}. We then study an alternative prompt template that emphasizes the object more than the region: {regional adjective} {object} i.e. "European bag", which leads to higher background diversity. Red outlines show object/background decompositions.
  • Figure 3: In generated images, objects tend to have better realism (precision) than backgrounds, while representation diversity (coverage) is similar on average between objects and backgrounds. Shown are precision and coverage measurements averaged over all regions for Obj-only and BG-only set-ups.
  • Figure 4: Backgrounds in generated images have larger disparities in realism and diversity between geographic regions than objects. We observe the larger variance of precision and coverage values for BG-only compared to Obj-only set-up.
  • Figure 5: When generating objects in Africa, the LDM struggles to depict full background diversity, in particular backgrounds with buildings and paved streets (left) and neutral indoor scenes (right). Depicted are examples of real images where there are no generated images in the hypersphere of nearest neighbors for BG-only, but there are generated images in the hypersphere for Full-image and Obj-only (shown). Red outlines show object/background decompositions.
  • ...and 9 more figures