Table of Contents
Fetching ...

Improving Geo-diversity of Generated Images with Contextualized Vendi Score Guidance

Reyhane Askari Hemmat, Melissa Hall, Alicia Sun, Candace Ross, Michal Drozdzal, Adriana Romero-Soriano

TL;DR

This work addresses geographic biases in text-to-image generation by introducing Contextualized Vendi Score Guidance (c-VSG), an inference-time intervention for latent diffusion models. By coupling a memory bank of past generations with exemplar real images, c-VSG steers sampling toward diverse yet region-grounded outputs, using a differentiable Vendi Score as guidance. Across GeoDE and DollarStreet, c-VSG yields substantial improvements in worst-region and average diversity (F1, recall) while preserving or enhancing image quality (precision) and text-image consistency (CLIPScore), and it reduces cross-region disparities. The approach is validated through ablations, essays on computational trade-offs, and qualitative analyses, indicating its potential to better reflect the true geographic diversity of the world in generative imagery, with a note on limitations and future human evaluations.

Abstract

With the growing popularity of text-to-image generative models, there has been increasing focus on understanding their risks and biases. Recent work has found that state-of-the-art models struggle to depict everyday objects with the true diversity of the real world and have notable gaps between geographic regions. In this work, we aim to increase the diversity of generated images of common objects such that per-region variations are representative of the real world. We introduce an inference time intervention, contextualized Vendi Score Guidance (c-VSG), that guides the backwards steps of latent diffusion models to increase the diversity of a sample as compared to a "memory bank" of previously generated images while constraining the amount of variation within that of an exemplar set of real-world contextualizing images. We evaluate c-VSG with two geographically representative datasets and find that it substantially increases the diversity of generated images, both for the worst performing regions and on average, while simultaneously maintaining or improving image quality and consistency. Additionally, qualitative analyses reveal that diversity of generated images is significantly improved, including along the lines of reductive region portrayals present in the original model. We hope that this work is a step towards text-to-image generative models that reflect the true geographic diversity of the world.

Improving Geo-diversity of Generated Images with Contextualized Vendi Score Guidance

TL;DR

This work addresses geographic biases in text-to-image generation by introducing Contextualized Vendi Score Guidance (c-VSG), an inference-time intervention for latent diffusion models. By coupling a memory bank of past generations with exemplar real images, c-VSG steers sampling toward diverse yet region-grounded outputs, using a differentiable Vendi Score as guidance. Across GeoDE and DollarStreet, c-VSG yields substantial improvements in worst-region and average diversity (F1, recall) while preserving or enhancing image quality (precision) and text-image consistency (CLIPScore), and it reduces cross-region disparities. The approach is validated through ablations, essays on computational trade-offs, and qualitative analyses, indicating its potential to better reflect the true geographic diversity of the world in generative imagery, with a note on limitations and future human evaluations.

Abstract

With the growing popularity of text-to-image generative models, there has been increasing focus on understanding their risks and biases. Recent work has found that state-of-the-art models struggle to depict everyday objects with the true diversity of the real world and have notable gaps between geographic regions. In this work, we aim to increase the diversity of generated images of common objects such that per-region variations are representative of the real world. We introduce an inference time intervention, contextualized Vendi Score Guidance (c-VSG), that guides the backwards steps of latent diffusion models to increase the diversity of a sample as compared to a "memory bank" of previously generated images while constraining the amount of variation within that of an exemplar set of real-world contextualizing images. We evaluate c-VSG with two geographically representative datasets and find that it substantially increases the diversity of generated images, both for the worst performing regions and on average, while simultaneously maintaining or improving image quality and consistency. Additionally, qualitative analyses reveal that diversity of generated images is significantly improved, including along the lines of reductive region portrayals present in the original model. We hope that this work is a step towards text-to-image generative models that reflect the true geographic diversity of the world.
Paper Structure (28 sections, 9 equations, 15 figures, 8 tables, 1 algorithm)

This paper contains 28 sections, 9 equations, 15 figures, 8 tables, 1 algorithm.

Figures (15)

  • Figure 1: (a) We present Contextualized Vendi Score Guidance (c-VSG), an inference-time intervention to increase the diversity of images generated by latent diffusion models (LDMs). c-VSG guides backwards steps of the diffusion process using the Vendi Score friedman2022vendi to increase the diversity among a sample $x_t$ and a memory bank of previous generations (with weight $\alpha$) while constraining excessive variation using a small set of real, contextualizing exemplar images (with weight $\beta$). (b) Generations of dog in Africa, all with the same seed. First row has zero c-VSG guidance scale and as a result all samples are the same. As we increase the c-VSG guidance scale, we observe increased diversity in generations.
  • Figure 2: Generated images of cooking pots (Left) and cars (Right). The same six seeds are shared among the examples, and the box colors indicate images pertaining to Africa, Europe, and Southeast Asia. Vendi Score Guidance increases the diversity of generated images, including object type, positioning, and quality. Contextualization with exemplar images increases similarity to real world diversity. (More examples are shown in Appendix Figures \ref{['fig:res_visuals_abalations']} and \ref{['fig:res_visuals_abalations_2']}.)
  • Figure 3: Examples of images from the real world reference dataset, GeoDEramaswamy2022geode, and images generated with the original LDM using the prompt object in region, for example cooking pot in Africa. Generated images of objects lack diversity compared to real world and introduce some region level dependencies in object depiction not seen in GeoDE, such as dilapidated cars for Africa. The colors indicate images pertaining to Africa, Europe, and Southeast Asia.
  • Figure 4: Examples of image seeds before (TOP) and after (BOTTOM) applying c-VSG to the LDM. The prevalence of unrepresentative background info in a given image is reduced as the dogs become larger and more centrally focused.
  • Figure 5: Examples of increased variation in background with c-VSG, even for low consistency images.
  • ...and 10 more figures