A Progressive Evaluation Framework for Multicultural Analysis of Story Visualization
Janak Kapuriya, Ali Hatami, Paul Buitelaar
TL;DR
<3-5 sentence high-level summary> The paper tackles cultural fidelity gaps in multilingual story visualization by introducing a Progressive Multiculture Evaluation Framework that combines five culturally-aware metrics with an automated MLLM-as-Jury mechanism to approximate human judgments across English, Chinese, and Hindi on real-world and animated datasets. It outlines a translation and multilingual diffusion-based generation pipeline (MuLan) and two evaluation schemes (Sequential and Individual Scene) to critique narrative visuals. Key contributions include the five metrics (Cultural Appropriateness, Visual Aesthetics, Cohesion, Object Presence, Semantic Consistency), the three-level progressive evaluation (V1–V3) with illustrative examples, and the MLLM-as-Jury framework that aggregates three diverse judge models. Findings show a strong Western bias in generated visuals, better cultural alignment on real-world data, and variable correlations between MLLM-as-Jury and human judgments, highlighting the need for more culturally grounded multilingual storytelling methods. The work provides a structured pathway for evaluating and improving cultural fidelity in visual narratives across languages and cultures.
Abstract
Recent advancements in text-to-image generative models have improved narrative consistency in story visualization. However, current story visualization models often overlook cultural dimensions, resulting in visuals that lack authenticity and cultural fidelity. In this study, we conduct a comprehensive multicultural analysis of story visualization using current text-to-image models across multilingual settings on two datasets: FlintstonesSV and VIST. To assess cultural dimensions rigorously, we propose a Progressive Multicultural Evaluation Framework and introduce five story visualization metrics, Cultural Appropriateness, Visual Aesthetics, Cohesion, Semantic Consistency, and Object Presence, that are not addressed by existing metrics. We further automate assessment through an MLLM-as-Jury framework that approximates human judgment. Human evaluations show that models generate more coherent, visually appealing, and culturally appropriate stories for real-world datasets than for animated ones. The generated stories exhibit a stronger alignment with English-speaking cultures across all metrics except Cohesion, where Chinese performs better. In contrast, Hindi ranks lowest on all metrics except Visual Aesthetics, reflecting real-world cultural biases embedded in current models. This multicultural analysis provides a foundation for future research aimed at generating culturally appropriate and inclusive visual stories across diverse linguistic and cultural settings.
