Table of Contents
Fetching ...

GeoDiv: Framework For Measuring Geographical Diversity In Text-To-Image Models

Abhipsa Basu, Mohana Singh, Shashank Agnihotri, Margret Keuper, R. Venkatesh Babu

TL;DR

GeoDiv is introduced, a framework leveraging large language and vision-language models to assess geographical diversity along two complementary axes: the Socio-Economic Visual Index (SEVI), capturing economic and condition-related cues, and the Visual Diversity Index (VDI), measuring variation in primary entities and backgrounds.

Abstract

Text-to-image (T2I) models are rapidly gaining popularity, yet their outputs often lack geographical diversity, reinforce stereotypes, and misrepresent regions. Given their broad reach, it is critical to rigorously evaluate how these models portray the world. Existing diversity metrics either rely on curated datasets or focus on surface-level visual similarity, limiting interpretability. We introduce GeoDiv, a framework leveraging large language and vision-language models to assess geographical diversity along two complementary axes: the Socio-Economic Visual Index (SEVI), capturing economic and condition-related cues, and the Visual Diversity Index (VDI), measuring variation in primary entities and backgrounds. Applied to images generated by models such as Stable Diffusion and FLUX.1-dev across $10$ entities and $16$ countries, GeoDiv reveals a consistent lack of diversity and identifies fine-grained attributes where models default to biased portrayals. Strikingly, depictions of countries like India, Nigeria, and Colombia are disproportionately impoverished and worn, reflecting underlying socio-economic biases. These results highlight the need for greater geographical nuance in generative models. GeoDiv provides the first systematic, interpretable framework for measuring such biases, marking a step toward fairer and more inclusive generative systems. Project page: https://abhipsabasu.github.io/geodiv

GeoDiv: Framework For Measuring Geographical Diversity In Text-To-Image Models

TL;DR

GeoDiv is introduced, a framework leveraging large language and vision-language models to assess geographical diversity along two complementary axes: the Socio-Economic Visual Index (SEVI), capturing economic and condition-related cues, and the Visual Diversity Index (VDI), measuring variation in primary entities and backgrounds.

Abstract

Text-to-image (T2I) models are rapidly gaining popularity, yet their outputs often lack geographical diversity, reinforce stereotypes, and misrepresent regions. Given their broad reach, it is critical to rigorously evaluate how these models portray the world. Existing diversity metrics either rely on curated datasets or focus on surface-level visual similarity, limiting interpretability. We introduce GeoDiv, a framework leveraging large language and vision-language models to assess geographical diversity along two complementary axes: the Socio-Economic Visual Index (SEVI), capturing economic and condition-related cues, and the Visual Diversity Index (VDI), measuring variation in primary entities and backgrounds. Applied to images generated by models such as Stable Diffusion and FLUX.1-dev across entities and countries, GeoDiv reveals a consistent lack of diversity and identifies fine-grained attributes where models default to biased portrayals. Strikingly, depictions of countries like India, Nigeria, and Colombia are disproportionately impoverished and worn, reflecting underlying socio-economic biases. These results highlight the need for greater geographical nuance in generative models. GeoDiv provides the first systematic, interpretable framework for measuring such biases, marking a step toward fairer and more inclusive generative systems. Project page: https://abhipsabasu.github.io/geodiv
Paper Structure (60 sections, 1 equation, 24 figures, 14 tables)

This paper contains 60 sections, 1 equation, 24 figures, 14 tables.

Figures (24)

  • Figure 2: GeoDiv Pipeline. Given an entity $e$ and country $c$, LLMs generate attribute-based questions specific to $e$, and a fixed set of background-related questions applicable across entities. A VQA model predicts answer distributions over an image set for both question types, from which GeoDiv computes the Visual Diversity Index (VDI) via normalized Hill number. The VQA model also rates each image on Affluence and Maintenance to compute the Socio-Economic Visual Index (SEVI).
  • Figure 3: SEVI Diversity and Mean Ratings across Datasets and Countries. India (IN), Nigeria (NG), and Colombia (CO) are seen to receive lower SEVI ratings, while the US, UK, and Japan (JP) rank highest—revealing strong socio-economic biases in country-level image representations. Strikingly, none of the models generate images spanning diverse socio-economic strata.
  • Figure 4: VDI Scores across (a) Datasets, (b) Countries. Model-wise VDI diversities are similar, with SD2.1 achieving higher scores than the others. Mexico and the UK show low entity and background diversity, while Japan scores highest.
  • Figure 5: Country-wise maximum and mean JS Divergence across Entities and Models. High maximum values for both Entity and Background indicate high cross-country variations in the respective attribute value distributions.
  • Figure 6: Affluence and Maintenance (SEVI) Scores across Entities. Chair and Stove images show the highest variance in Affluence, whereas Cooking Pot and Stove images appear the least affluent. For Maintenance, Stove, Cooking Pot and Chair turn out to be the most diverse, though the mean ratings are low for each of them.
  • ...and 19 more figures