Table of Contents
Fetching ...

Where Culture Fades: Revealing the Cultural Gap in Text-to-Image Generation

Chuancheng Shi, Shangze Li, Shiming Guo, Simiao Xie, Wenhua Wu, Jingtong Dou, Chao Wu, Canran Xiao, Cong Wang, Zifeng Cheng, Fei Shen, Tat-Seng Chua

TL;DR

Multilingual text-to-image models often generate culturally neutral or English-biased images under prompts in multiple languages. The authors show that culture-related knowledge is present but under-activated and introduce CultureBench, a probing framework, and two lightweight alignment strategies—inference-time cultural activation and layer-targeted enhancement—to boost cultural consistency. They demonstrate significant improvements on CultureBench in cultural attribution and visual fidelity, with results supported by automatic metrics and human evaluation. This work provides a practical path toward inclusive, culturally grounded image synthesis and offers a diagnostic benchmark for cross-cultural T2I systems.

Abstract

Multilingual text-to-image (T2I) models have advanced rapidly in terms of visual realism and semantic alignment, and are now widely utilized. Yet outputs vary across cultural contexts: because language carries cultural connotations, images synthesized from multilingual prompts should preserve cross-lingual cultural consistency. We conduct a comprehensive analysis showing that current T2I models often produce culturally neutral or English-biased results under multilingual prompts. Analyses of two representative models indicate that the issue stems not from missing cultural knowledge but from insufficient activation of culture-related representations. We propose a probing method that localizes culture-sensitive signals to a small set of neurons in a few fixed layers. Guided by this finding, we introduce two complementary alignment strategies: (1) inference-time cultural activation that amplifies the identified neurons without backbone fine-tuned; and (2) layer-targeted cultural enhancement that updates only culturally relevant layers. Experiments on our CultureBench demonstrate consistent improvements over strong baselines in cultural consistency while preserving fidelity and diversity.

Where Culture Fades: Revealing the Cultural Gap in Text-to-Image Generation

TL;DR

Multilingual text-to-image models often generate culturally neutral or English-biased images under prompts in multiple languages. The authors show that culture-related knowledge is present but under-activated and introduce CultureBench, a probing framework, and two lightweight alignment strategies—inference-time cultural activation and layer-targeted enhancement—to boost cultural consistency. They demonstrate significant improvements on CultureBench in cultural attribution and visual fidelity, with results supported by automatic metrics and human evaluation. This work provides a practical path toward inclusive, culturally grounded image synthesis and offers a diagnostic benchmark for cross-cultural T2I systems.

Abstract

Multilingual text-to-image (T2I) models have advanced rapidly in terms of visual realism and semantic alignment, and are now widely utilized. Yet outputs vary across cultural contexts: because language carries cultural connotations, images synthesized from multilingual prompts should preserve cross-lingual cultural consistency. We conduct a comprehensive analysis showing that current T2I models often produce culturally neutral or English-biased results under multilingual prompts. Analyses of two representative models indicate that the issue stems not from missing cultural knowledge but from insufficient activation of culture-related representations. We propose a probing method that localizes culture-sensitive signals to a small set of neurons in a few fixed layers. Guided by this finding, we introduce two complementary alignment strategies: (1) inference-time cultural activation that amplifies the identified neurons without backbone fine-tuned; and (2) layer-targeted cultural enhancement that updates only culturally relevant layers. Experiments on our CultureBench demonstrate consistent improvements over strong baselines in cultural consistency while preserving fidelity and diversity.

Paper Structure

This paper contains 31 sections, 13 equations, 19 figures, 6 tables.

Figures (19)

  • Figure 1: Cultural alignment in local languages. (a) LLMs/recommenders keep cultural consistency, but T2I models falter with "noun-only’’ prompts. (b) Adding a "culture-style modifier + noun’’ restores consistency.
  • Figure 2: Overview of the CultureBench pipeline. First, manually collect and rigorously quality-control datasets from 15 linguistic regions; annotate “culture-style modifier noun" captions using GPT5-Nano chatgpt5 and, through human annotation, “noun-only" captions; convert annotated content into local languages via translation tools, supplemented by manual review.
  • Figure 3: Data distribution of the proposed CultureBench dataset across 15 languages. The dataset is divided into train, test, and neuron-detection subsets with a ratio of 7:2:1.
  • Figure 4: Verify the hypothesis. Within the CultureBench test subset, performances under “culture-style modifier + noun” and “noun-only” prompt conditions are compared. Quantitative evaluation is conducted using CultureVQA.
  • Figure 5: Methods for Neuronal Detection. (a) By comparing attention allocation between cultural-style modifiers and nouns across text-encoder layers, the layer with the largest divergence is designated as the culturally sensitive layer. (b) At this layer, features from the “culture-style modifier + noun’’ and “noun-only’’ prompts are fed into an SAE cunningham2023sparse to obtain sparse features, revealing neurons with heightened sensitivity to cultural cues.
  • ...and 14 more figures