Where Culture Fades: Revealing the Cultural Gap in Text-to-Image Generation

Chuancheng Shi; Shangze Li; Shiming Guo; Simiao Xie; Wenhua Wu; Jingtong Dou; Chao Wu; Canran Xiao; Cong Wang; Zifeng Cheng; Fei Shen; Tat-Seng Chua

Where Culture Fades: Revealing the Cultural Gap in Text-to-Image Generation

Chuancheng Shi, Shangze Li, Shiming Guo, Simiao Xie, Wenhua Wu, Jingtong Dou, Chao Wu, Canran Xiao, Cong Wang, Zifeng Cheng, Fei Shen, Tat-Seng Chua

TL;DR

Multilingual text-to-image models often generate culturally neutral or English-biased images under prompts in multiple languages. The authors show that culture-related knowledge is present but under-activated and introduce CultureBench, a probing framework, and two lightweight alignment strategies—inference-time cultural activation and layer-targeted enhancement—to boost cultural consistency. They demonstrate significant improvements on CultureBench in cultural attribution and visual fidelity, with results supported by automatic metrics and human evaluation. This work provides a practical path toward inclusive, culturally grounded image synthesis and offers a diagnostic benchmark for cross-cultural T2I systems.

Abstract

Multilingual text-to-image (T2I) models have advanced rapidly in terms of visual realism and semantic alignment, and are now widely utilized. Yet outputs vary across cultural contexts: because language carries cultural connotations, images synthesized from multilingual prompts should preserve cross-lingual cultural consistency. We conduct a comprehensive analysis showing that current T2I models often produce culturally neutral or English-biased results under multilingual prompts. Analyses of two representative models indicate that the issue stems not from missing cultural knowledge but from insufficient activation of culture-related representations. We propose a probing method that localizes culture-sensitive signals to a small set of neurons in a few fixed layers. Guided by this finding, we introduce two complementary alignment strategies: (1) inference-time cultural activation that amplifies the identified neurons without backbone fine-tuned; and (2) layer-targeted cultural enhancement that updates only culturally relevant layers. Experiments on our CultureBench demonstrate consistent improvements over strong baselines in cultural consistency while preserving fidelity and diversity.

Where Culture Fades: Revealing the Cultural Gap in Text-to-Image Generation

TL;DR

Abstract

Where Culture Fades: Revealing the Cultural Gap in Text-to-Image Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (19)