Table of Contents
Fetching ...

A Progressive Evaluation Framework for Multicultural Analysis of Story Visualization

Janak Kapuriya, Ali Hatami, Paul Buitelaar

TL;DR

<3-5 sentence high-level summary> The paper tackles cultural fidelity gaps in multilingual story visualization by introducing a Progressive Multiculture Evaluation Framework that combines five culturally-aware metrics with an automated MLLM-as-Jury mechanism to approximate human judgments across English, Chinese, and Hindi on real-world and animated datasets. It outlines a translation and multilingual diffusion-based generation pipeline (MuLan) and two evaluation schemes (Sequential and Individual Scene) to critique narrative visuals. Key contributions include the five metrics (Cultural Appropriateness, Visual Aesthetics, Cohesion, Object Presence, Semantic Consistency), the three-level progressive evaluation (V1–V3) with illustrative examples, and the MLLM-as-Jury framework that aggregates three diverse judge models. Findings show a strong Western bias in generated visuals, better cultural alignment on real-world data, and variable correlations between MLLM-as-Jury and human judgments, highlighting the need for more culturally grounded multilingual storytelling methods. The work provides a structured pathway for evaluating and improving cultural fidelity in visual narratives across languages and cultures.

Abstract

Recent advancements in text-to-image generative models have improved narrative consistency in story visualization. However, current story visualization models often overlook cultural dimensions, resulting in visuals that lack authenticity and cultural fidelity. In this study, we conduct a comprehensive multicultural analysis of story visualization using current text-to-image models across multilingual settings on two datasets: FlintstonesSV and VIST. To assess cultural dimensions rigorously, we propose a Progressive Multicultural Evaluation Framework and introduce five story visualization metrics, Cultural Appropriateness, Visual Aesthetics, Cohesion, Semantic Consistency, and Object Presence, that are not addressed by existing metrics. We further automate assessment through an MLLM-as-Jury framework that approximates human judgment. Human evaluations show that models generate more coherent, visually appealing, and culturally appropriate stories for real-world datasets than for animated ones. The generated stories exhibit a stronger alignment with English-speaking cultures across all metrics except Cohesion, where Chinese performs better. In contrast, Hindi ranks lowest on all metrics except Visual Aesthetics, reflecting real-world cultural biases embedded in current models. This multicultural analysis provides a foundation for future research aimed at generating culturally appropriate and inclusive visual stories across diverse linguistic and cultural settings.

A Progressive Evaluation Framework for Multicultural Analysis of Story Visualization

TL;DR

<3-5 sentence high-level summary> The paper tackles cultural fidelity gaps in multilingual story visualization by introducing a Progressive Multiculture Evaluation Framework that combines five culturally-aware metrics with an automated MLLM-as-Jury mechanism to approximate human judgments across English, Chinese, and Hindi on real-world and animated datasets. It outlines a translation and multilingual diffusion-based generation pipeline (MuLan) and two evaluation schemes (Sequential and Individual Scene) to critique narrative visuals. Key contributions include the five metrics (Cultural Appropriateness, Visual Aesthetics, Cohesion, Object Presence, Semantic Consistency), the three-level progressive evaluation (V1–V3) with illustrative examples, and the MLLM-as-Jury framework that aggregates three diverse judge models. Findings show a strong Western bias in generated visuals, better cultural alignment on real-world data, and variable correlations between MLLM-as-Jury and human judgments, highlighting the need for more culturally grounded multilingual storytelling methods. The work provides a structured pathway for evaluating and improving cultural fidelity in visual narratives across languages and cultures.

Abstract

Recent advancements in text-to-image generative models have improved narrative consistency in story visualization. However, current story visualization models often overlook cultural dimensions, resulting in visuals that lack authenticity and cultural fidelity. In this study, we conduct a comprehensive multicultural analysis of story visualization using current text-to-image models across multilingual settings on two datasets: FlintstonesSV and VIST. To assess cultural dimensions rigorously, we propose a Progressive Multicultural Evaluation Framework and introduce five story visualization metrics, Cultural Appropriateness, Visual Aesthetics, Cohesion, Semantic Consistency, and Object Presence, that are not addressed by existing metrics. We further automate assessment through an MLLM-as-Jury framework that approximates human judgment. Human evaluations show that models generate more coherent, visually appealing, and culturally appropriate stories for real-world datasets than for animated ones. The generated stories exhibit a stronger alignment with English-speaking cultures across all metrics except Cohesion, where Chinese performs better. In contrast, Hindi ranks lowest on all metrics except Visual Aesthetics, reflecting real-world cultural biases embedded in current models. This multicultural analysis provides a foundation for future research aimed at generating culturally appropriate and inclusive visual stories across diverse linguistic and cultural settings.

Paper Structure

This paper contains 27 sections, 2 equations, 30 figures, 7 tables.

Figures (30)

  • Figure 1: Cultural inconsistencies and stereotypes in generated story scenes across models and languages. Ex–1: Model is not interpreting the word 'president' in Chinese as 'party leader' in China, and generated an image of a US president. Ex–2: Instead of a modern craft fair, it depicts temples and a crowded assembly of saints, reinforcing the stereotype of an ancient Mahakumbh style fair. Ex–3: Red lantern stereotypes dominate the rides, misrepresenting Chinese culture. Ex–4: A Hindi cultural stereotype depicts a parent leading children in a race rather than using a stroller, with several children barefoot, reinforcing inaccurate cultural assumptions.
  • Figure 2: Three-stage framework for multicultural analysis of story visualization using text-to-image models.
  • Figure 3: Human evaluation results across datasets, languages and evaluation metrics.
  • Figure 4: Progressive Culture evaluation scores by MLLM-as-Jury across three levels on FlintstonesSV and VIST datasets.
  • Figure 5: Correlation between Human and MLLM-as-Jury evaluation on FlintstonesSV and VIST datasets
  • ...and 25 more figures