Table of Contents
Fetching ...

CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics

Shravan Nayak, Mehar Bhatia, Xiaofeng Zhang, Verena Rieser, Lisa Anne Hendricks, Sjoerd van Steenkiste, Yash Goyal, Karolina Stańczak, Aishwarya Agrawal

TL;DR

CulturalFrames provides the first large-scale, quantitative benchmark to evaluate how well text-to-image models align with both explicit and implicit cultural expectations across 10 countries and 5 cultural domains. The study combines a culturally grounded prompting pipeline, multi-model image generation, and extensive human annotations to reveal substantial cultural misalignment (44% overall, with 68% explicit and 49% implicit failures) and poor correlation between automatic metrics and human judgments. It finds that current VLM-based metrics (notably VIEScore and UnifiedReward) best approximate human judgments but still fall short, and demonstrates that task-specific prompt expansion and refined instructions can modestly improve alignment. The work highlights actionable directions for improving culturally informed generation and evaluation, including richer cultural knowledge integration, explicit handling of implicit cues, and metric redesign. The CulturalFrames dataset thus serves as a testbed to calibrate both generation models and evaluation metrics toward globally usable and culturally aware T2I systems.

Abstract

The increasing ubiquity of text-to-image (T2I) models as tools for visual content generation raises concerns about their ability to accurately represent diverse cultural contexts -- where missed cues can stereotype communities and undermine usability. In this work, we present the first study to systematically quantify the alignment of T2I models and evaluation metrics with respect to both explicit (stated) as well as implicit (unstated, implied by the prompt's cultural context) cultural expectations. To this end, we introduce CulturalFrames, a novel benchmark designed for rigorous human evaluation of cultural representation in visual generations. Spanning 10 countries and 5 socio-cultural domains, CulturalFrames comprises 983 prompts, 3637 corresponding images generated by 4 state-of-the-art T2I models, and over 10k detailed human annotations. We find that across models and countries, cultural expectations are missed an average of 44% of the time. Among these failures, explicit expectations are missed at a surprisingly high average rate of 68%, while implicit expectation failures are also significant, averaging 49%. Furthermore, we show that existing T2I evaluation metrics correlate poorly with human judgments of cultural alignment, irrespective of their internal reasoning. Collectively, our findings expose critical gaps, provide a concrete testbed, and outline actionable directions for developing culturally informed T2I models and metrics that improve global usability.

CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics

TL;DR

CulturalFrames provides the first large-scale, quantitative benchmark to evaluate how well text-to-image models align with both explicit and implicit cultural expectations across 10 countries and 5 cultural domains. The study combines a culturally grounded prompting pipeline, multi-model image generation, and extensive human annotations to reveal substantial cultural misalignment (44% overall, with 68% explicit and 49% implicit failures) and poor correlation between automatic metrics and human judgments. It finds that current VLM-based metrics (notably VIEScore and UnifiedReward) best approximate human judgments but still fall short, and demonstrates that task-specific prompt expansion and refined instructions can modestly improve alignment. The work highlights actionable directions for improving culturally informed generation and evaluation, including richer cultural knowledge integration, explicit handling of implicit cues, and metric redesign. The CulturalFrames dataset thus serves as a testbed to calibrate both generation models and evaluation metrics toward globally usable and culturally aware T2I systems.

Abstract

The increasing ubiquity of text-to-image (T2I) models as tools for visual content generation raises concerns about their ability to accurately represent diverse cultural contexts -- where missed cues can stereotype communities and undermine usability. In this work, we present the first study to systematically quantify the alignment of T2I models and evaluation metrics with respect to both explicit (stated) as well as implicit (unstated, implied by the prompt's cultural context) cultural expectations. To this end, we introduce CulturalFrames, a novel benchmark designed for rigorous human evaluation of cultural representation in visual generations. Spanning 10 countries and 5 socio-cultural domains, CulturalFrames comprises 983 prompts, 3637 corresponding images generated by 4 state-of-the-art T2I models, and over 10k detailed human annotations. We find that across models and countries, cultural expectations are missed an average of 44% of the time. Among these failures, explicit expectations are missed at a surprisingly high average rate of 68%, while implicit expectation failures are also significant, averaging 49%. Furthermore, we show that existing T2I evaluation metrics correlate poorly with human judgments of cultural alignment, irrespective of their internal reasoning. Collectively, our findings expose critical gaps, provide a concrete testbed, and outline actionable directions for developing culturally informed T2I models and metrics that improve global usability.

Paper Structure

This paper contains 54 sections, 24 figures, 9 tables.

Figures (24)

  • Figure 1: Examples from CulturalFrames benchmark for three selected countries: India, China, and Poland. We ask annotators to evaluate the generated images with respect to both explicit and implicit cultural expectations.
  • Figure 2: Overview of the CulturalFrames dataset pipeline and annotation process. Prompts are first generated using cultural assertions from the Cultural Atlas across categories such as religion and family (top-left). These are transformed into culturally grounded textual prompts using large language models and human filtering (top-middle), and then rendered into images using state-of-the-art text-to-image models (top-right). Human annotators provide fine-grained evaluations (bottom) across four axes: image-prompt alignment, image quality, stereotype presence, and overall score, along with detailed feedback highlighting cultural inaccuracies and visual artifacts.
  • Figure 3: Distribution of image-prompt alignment errors (score <1) by model, grouped by error type: implicit, explicit, or both. Bar lengths show fraction of total errors; % show each type's share of the model's total errors.
  • Figure 4: Human evaluation results for selected T2I models. From left to right: 1) Prompt Alignment ($0-1$ scale, $1=$perfect alignment). 2) Image Quality ($0-1$ scale, $1=$highest quality). 3) Stereotype Score ($0-1$ scale, $0$ indicates no stereotyping). 4) Overall Score ($1-5$ Likert scale, $5=$best overall). For fairness, we compare across prompts that have images generated by all models.
  • Figure 5: Prompt alignment scores across countries for a given model.
  • ...and 19 more figures