Culture In a Frame: C$^3$B as a Comic-Based Benchmark for Multimodal Culturally Awareness

Yuchen Song; Andong Chen; Wenxin Zhu; Kehai Chen; Xuefeng Bai; Muyun Yang; Tiejun Zhao

Culture In a Frame: C$^3$B as a Comic-Based Benchmark for Multimodal Culturally Awareness

Yuchen Song, Andong Chen, Wenxin Zhu, Kehai Chen, Xuefeng Bai, Muyun Yang, Tiejun Zhao

TL;DR

This paper introduces $C^3B$, a comics-based benchmark designed to evaluate cultural awareness in multimodal LLMs across multicultural, multitask, and multilingual settings. It comprises 2,220 images and 18,789 QA pairs organized into three progressively difficult tasks—Extraction@Culture, Conflict@Culture, and Generation@Culture—to probe visual recognition, cultural conflict understanding, and multilingual content generation. Evaluations on 11 open-source MLLMs reveal a substantial gap to human performance, with models showing particular weakness on lesser-known cultures and cultural conflicts. The work provides a strong baseline and diagnostic insights to guide future efforts in enhancing cross-cultural understanding in multimodal systems.

Abstract

Cultural awareness capabilities has emerged as a critical capability for Multimodal Large Language Models (MLLMs). However, current benchmarks lack progressed difficulty in their task design and are deficient in cross-lingual tasks. Moreover, current benchmarks often use real-world images. Each real-world image typically contains one culture, making these benchmarks relatively easy for MLLMs. Based on this, we propose C$^3$B ($\textbf{C}$omics $\textbf{C}$ross-$\textbf{C}$ultural $\textbf{B}$enchmark), a novel multicultural, multitask and multilingual cultural awareness capabilities benchmark. C$^3$B comprises over 2000 images and over 18000 QA pairs, constructed on three tasks with progressed difficulties, from basic visual recognition to higher-level cultural conflict understanding, and finally to cultural content generation. We conducted evaluations on 11 open-source MLLMs, revealing a significant performance gap between MLLMs and human performance. The gap demonstrates that C$^3$B poses substantial challenges for current MLLMs, encouraging future research to advance the cultural awareness capabilities of MLLMs.

Culture In a Frame: C$^3$B as a Comic-Based Benchmark for Multimodal Culturally Awareness

TL;DR

This paper introduces

, a comics-based benchmark designed to evaluate cultural awareness in multimodal LLMs across multicultural, multitask, and multilingual settings. It comprises 2,220 images and 18,789 QA pairs organized into three progressively difficult tasks—Extraction@Culture, Conflict@Culture, and Generation@Culture—to probe visual recognition, cultural conflict understanding, and multilingual content generation. Evaluations on 11 open-source MLLMs reveal a substantial gap to human performance, with models showing particular weakness on lesser-known cultures and cultural conflicts. The work provides a strong baseline and diagnostic insights to guide future efforts in enhancing cross-cultural understanding in multimodal systems.

Abstract

B (

omics

ross-

ultural

enchmark), a novel multicultural, multitask and multilingual cultural awareness capabilities benchmark. C

B comprises over 2000 images and over 18000 QA pairs, constructed on three tasks with progressed difficulties, from basic visual recognition to higher-level cultural conflict understanding, and finally to cultural content generation. We conducted evaluations on 11 open-source MLLMs, revealing a significant performance gap between MLLMs and human performance. The gap demonstrates that C

B poses substantial challenges for current MLLMs, encouraging future research to advance the cultural awareness capabilities of MLLMs.

Culture In a Frame: C$^3$B as a Comic-Based Benchmark for Multimodal Culturally Awareness

TL;DR

Abstract

Culture In a Frame: C$^3$B as a Comic-Based Benchmark for Multimodal Culturally Awareness

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (17)