Table of Contents
Fetching ...

Culture in Action: Evaluating Text-to-Image Models through Social Activities

Sina Malakouti, Boqing Gong, Adriana Kovashka

TL;DR

The paper tackles the problem of cultural faithfulness in text-to-image diffusion models by introducing CULTIVate, a benchmark spanning 16 countries and 9 social activities to evaluate cross-cultural depiction. It proposes AHEaD, a descriptor-based diagnostic suite (ALIGN, HAL, EXAG, DDIV/SDIV, and FAITH) powered by a proposer–refiner pipeline that uses LLMs to generate culturally meaningful descriptors and MLLMs to extract them from images. Empirical results show systematic global north biases, with higher alignment for GN than GS across multiple models, and demonstrate that AHEaD correlates better with human judgments than existing image-text alignment metrics, especially when used in combination. The work provides a scalable, interpretable framework for improving and auditing cultural fidelity in AI-generated imagery, with broad implications for entertainment, marketing, and research on cultural representation.

Abstract

Text-to-image (T2I) diffusion models achieve impressive photorealism by training on large-scale web data, but models inherit cultural biases and fail to depict underrepresented regions faithfully. Existing cultural benchmarks focus mainly on object-centric categories (e.g., food, attire, and architecture), overlooking the social and daily activities that more clearly reflect cultural norms. Few metrics exist for measuring cultural faithfulness. We introduce CULTIVate, a benchmark for evaluating T2I models on cross-cultural activities (e.g., greetings, dining, games, traditional dances, and cultural celebrations). CULTIVate spans 16 countries with 576 prompts and more than 19,000 images, and provides an explainable descriptor-based evaluation framework across multiple cultural dimensions, including background, attire, objects, and interactions. We propose four metrics to measure cultural alignment, hallucination, exaggerated elements, and diversity. Our findings reveal systematic disparities: models perform better for global north countries than for the global south, with distinct failure modes across T2I systems. Human studies confirm that our metrics correlate more strongly with human judgments than existing text-image metrics.

Culture in Action: Evaluating Text-to-Image Models through Social Activities

TL;DR

The paper tackles the problem of cultural faithfulness in text-to-image diffusion models by introducing CULTIVate, a benchmark spanning 16 countries and 9 social activities to evaluate cross-cultural depiction. It proposes AHEaD, a descriptor-based diagnostic suite (ALIGN, HAL, EXAG, DDIV/SDIV, and FAITH) powered by a proposer–refiner pipeline that uses LLMs to generate culturally meaningful descriptors and MLLMs to extract them from images. Empirical results show systematic global north biases, with higher alignment for GN than GS across multiple models, and demonstrate that AHEaD correlates better with human judgments than existing image-text alignment metrics, especially when used in combination. The work provides a scalable, interpretable framework for improving and auditing cultural fidelity in AI-generated imagery, with broad implications for entertainment, marketing, and research on cultural representation.

Abstract

Text-to-image (T2I) diffusion models achieve impressive photorealism by training on large-scale web data, but models inherit cultural biases and fail to depict underrepresented regions faithfully. Existing cultural benchmarks focus mainly on object-centric categories (e.g., food, attire, and architecture), overlooking the social and daily activities that more clearly reflect cultural norms. Few metrics exist for measuring cultural faithfulness. We introduce CULTIVate, a benchmark for evaluating T2I models on cross-cultural activities (e.g., greetings, dining, games, traditional dances, and cultural celebrations). CULTIVate spans 16 countries with 576 prompts and more than 19,000 images, and provides an explainable descriptor-based evaluation framework across multiple cultural dimensions, including background, attire, objects, and interactions. We propose four metrics to measure cultural alignment, hallucination, exaggerated elements, and diversity. Our findings reveal systematic disparities: models perform better for global north countries than for the global south, with distinct failure modes across T2I systems. Human studies confirm that our metrics correlate more strongly with human judgments than existing text-image metrics.

Paper Structure

This paper contains 22 sections, 7 equations, 7 figures, 27 tables.

Figures (7)

  • Figure 1: (a) Examples of good (aligned) and bad (hallucinated or exaggerated) aspects of images generated for three cultural activities; these aspects are automatically computed by our framework. (b) Contrasting real and generated images.
  • Figure 2: (Top) Overview of AHEaD. We generate images and extract predicted descriptors $\hat{\mathcal{D}}^{\text{mllm}}$ with an MLLM, while reference LLM descriptors $\mathcal{D}^{\text{llm}}$ are obtained via a proposer–refiner pipeline. Proposers generate diverse candidates, and the Refiner removes duplicates and filters incorrect ones. AHEaD measures cultural competence through alignment, hallucination, exaggeration, and diversity, providing not only quantitative scores but also interpretable feedback (i.e., what is aligned, missing, or exaggerated). (Bottom) Cultural Faithfulness metrics. Alignment measures whether expected descriptors are present (similarity above threshold $\tau$), hallucination flags elements unsupported by references (e.g., circular arrangement), and exaggeration detects stereotypical cues overemphasized with respect to real-images (e.g., muslim attire)
  • Figure 3: Analysis of performance by country (left) and activity (right).
  • Figure 4: Country alignment ranked using each of the five descriptor dimensions. (Zoom to 250%.)
  • Figure 5: Comparison of baseline vs. our metric correlations.
  • ...and 2 more figures