Table of Contents
Fetching ...

Navigating Cultural Chasms: Exploring and Unlocking the Cultural POV of Text-To-Image Models

Mor Ventura, Eyal Ben-David, Anna Korhonen, Roi Reichart

TL;DR

This work investigates how culture is encoded and can be unlocked in multilingual text-to-image diffusion systems. By constructing a cultural ontology (dimensions, domains, concepts) and a cross-language dataset (CulText2I) spanning six TTI models and ten languages, the authors develop prompt templates to elicit cultural signals and a robust evaluation suite combining intrinsic CLIP-based metrics, extrinsic VQA-based tests, and human judgments. Key findings show that cultural knowledge is encoded with varying strength across languages and encoders, with implicit multilingual encoders often outperforming explicit ones, and that language cues and even single characters can reveal cultural features in generated images. The work provides practical prompts and evaluation protocols to study cross-cultural representations in TTI outputs, highlighting implications for cross-cultural AI applications and avenues for further improving multilingual, culturally aware generation.

Abstract

Text-To-Image (TTI) models, such as DALL-E and StableDiffusion, have demonstrated remarkable prompt-based image generation capabilities. Multilingual encoders may have a substantial impact on the cultural agency of these models, as language is a conduit of culture. In this study, we explore the cultural perception embedded in TTI models by characterizing culture across three hierarchical tiers: cultural dimensions, cultural domains, and cultural concepts. Based on this ontology, we derive prompt templates to unlock the cultural knowledge in TTI models, and propose a comprehensive suite of evaluation techniques, including intrinsic evaluations using the CLIP space, extrinsic evaluations with a Visual-Question-Answer (VQA) model and human assessments, to evaluate the cultural content of TTI-generated images. To bolster our research, we introduce the CulText2I dataset, derived from six diverse TTI models and spanning ten languages. Our experiments provide insights regarding Do, What, Which and How research questions about the nature of cultural encoding in TTI models, paving the way for cross-cultural applications of these models.

Navigating Cultural Chasms: Exploring and Unlocking the Cultural POV of Text-To-Image Models

TL;DR

This work investigates how culture is encoded and can be unlocked in multilingual text-to-image diffusion systems. By constructing a cultural ontology (dimensions, domains, concepts) and a cross-language dataset (CulText2I) spanning six TTI models and ten languages, the authors develop prompt templates to elicit cultural signals and a robust evaluation suite combining intrinsic CLIP-based metrics, extrinsic VQA-based tests, and human judgments. Key findings show that cultural knowledge is encoded with varying strength across languages and encoders, with implicit multilingual encoders often outperforming explicit ones, and that language cues and even single characters can reveal cultural features in generated images. The work provides practical prompts and evaluation protocols to study cross-cultural representations in TTI outputs, highlighting implications for cross-cultural AI applications and avenues for further improving multilingual, culturally aware generation.

Abstract

Text-To-Image (TTI) models, such as DALL-E and StableDiffusion, have demonstrated remarkable prompt-based image generation capabilities. Multilingual encoders may have a substantial impact on the cultural agency of these models, as language is a conduit of culture. In this study, we explore the cultural perception embedded in TTI models by characterizing culture across three hierarchical tiers: cultural dimensions, cultural domains, and cultural concepts. Based on this ontology, we derive prompt templates to unlock the cultural knowledge in TTI models, and propose a comprehensive suite of evaluation techniques, including intrinsic evaluations using the CLIP space, extrinsic evaluations with a Visual-Question-Answer (VQA) model and human assessments, to evaluate the cultural content of TTI-generated images. To bolster our research, we introduce the CulText2I dataset, derived from six diverse TTI models and spanning ten languages. Our experiments provide insights regarding Do, What, Which and How research questions about the nature of cultural encoding in TTI models, paving the way for cross-cultural applications of these models.
Paper Structure (45 sections, 21 figures, 9 tables)

This paper contains 45 sections, 21 figures, 9 tables.

Figures (21)

  • Figure 1: StableDiffusion 2.1v images of “A photo of <city>”, while city is translated to (left to right) Arabic, Chinese, German (top) English, Russian, Spanish (bottom).
  • Figure 2: StableDiffusion images generated from all the prompt templates (PTs) for the cultural concept (CC) of Wedding and the Hindi language.
  • Figure 3: TTI model workflow scheme. The visual representations of each Cultural Concept (CC) are image sets generated with different languages (L) and prompt templates (PTs) by different TTI models (M). Then, the images' cultural content is evaluated. Here, for example, CC is God, PT is Translated Concept, M is Llama2 + SD1.4 UNet (LB) and the evaluation uses the cultural dimensions metrics (§ \ref{['sec:auto_metrics']}).
  • Figure 4: National Association Scores by BLIP2 (XNA) presented as histograms. The x-axis represents bins of mean XNA scores ranging from 0 to 1 across three representative Prompt Templates (PTs): 'Translated Concept', 'EN with Nation', and 'English with Gibberish' (refer to Table \ref{['tab:prompt_templates']} for details). Higher scores indicate better performance. Colors encode languages.
  • Figure 5: A confusion matrix grid of the NA metric. Prompt Templates: "Translated Concept" (top), "EN with Nation" (bottom). Models: SD (left) and AD (right). Darker colors correspond to higher scores. y-axis: ground-truth languages. x-axis: predicted cultures. For each confusion matrix, we compute the agreement between the predicted and the ground-truth languages (Accuracy, $ACC = \frac{1}{n}\sum_{i=1}^{n} \mathbb{I}(\text{argmax}(\text{row}_i) = i)$) to measure the cultural encoding strength of a (model, PT) pair. Languages in each grid (top-bottom, left-right): RU, EN, EL, HI, DE, FR, ZH, ES, AR, IW.
  • ...and 16 more figures