Table of Contents
Fetching ...

Culture In a Frame: C$^3$B as a Comic-Based Benchmark for Multimodal Culturally Awareness

Yuchen Song, Andong Chen, Wenxin Zhu, Kehai Chen, Xuefeng Bai, Muyun Yang, Tiejun Zhao

TL;DR

This paper introduces $C^3B$, a comics-based benchmark designed to evaluate cultural awareness in multimodal LLMs across multicultural, multitask, and multilingual settings. It comprises 2,220 images and 18,789 QA pairs organized into three progressively difficult tasks—Extraction@Culture, Conflict@Culture, and Generation@Culture—to probe visual recognition, cultural conflict understanding, and multilingual content generation. Evaluations on 11 open-source MLLMs reveal a substantial gap to human performance, with models showing particular weakness on lesser-known cultures and cultural conflicts. The work provides a strong baseline and diagnostic insights to guide future efforts in enhancing cross-cultural understanding in multimodal systems.

Abstract

Cultural awareness capabilities has emerged as a critical capability for Multimodal Large Language Models (MLLMs). However, current benchmarks lack progressed difficulty in their task design and are deficient in cross-lingual tasks. Moreover, current benchmarks often use real-world images. Each real-world image typically contains one culture, making these benchmarks relatively easy for MLLMs. Based on this, we propose C$^3$B ($\textbf{C}$omics $\textbf{C}$ross-$\textbf{C}$ultural $\textbf{B}$enchmark), a novel multicultural, multitask and multilingual cultural awareness capabilities benchmark. C$^3$B comprises over 2000 images and over 18000 QA pairs, constructed on three tasks with progressed difficulties, from basic visual recognition to higher-level cultural conflict understanding, and finally to cultural content generation. We conducted evaluations on 11 open-source MLLMs, revealing a significant performance gap between MLLMs and human performance. The gap demonstrates that C$^3$B poses substantial challenges for current MLLMs, encouraging future research to advance the cultural awareness capabilities of MLLMs.

Culture In a Frame: C$^3$B as a Comic-Based Benchmark for Multimodal Culturally Awareness

TL;DR

This paper introduces , a comics-based benchmark designed to evaluate cultural awareness in multimodal LLMs across multicultural, multitask, and multilingual settings. It comprises 2,220 images and 18,789 QA pairs organized into three progressively difficult tasks—Extraction@Culture, Conflict@Culture, and Generation@Culture—to probe visual recognition, cultural conflict understanding, and multilingual content generation. Evaluations on 11 open-source MLLMs reveal a substantial gap to human performance, with models showing particular weakness on lesser-known cultures and cultural conflicts. The work provides a strong baseline and diagnostic insights to guide future efforts in enhancing cross-cultural understanding in multimodal systems.

Abstract

Cultural awareness capabilities has emerged as a critical capability for Multimodal Large Language Models (MLLMs). However, current benchmarks lack progressed difficulty in their task design and are deficient in cross-lingual tasks. Moreover, current benchmarks often use real-world images. Each real-world image typically contains one culture, making these benchmarks relatively easy for MLLMs. Based on this, we propose CB (omics ross-ultural enchmark), a novel multicultural, multitask and multilingual cultural awareness capabilities benchmark. CB comprises over 2000 images and over 18000 QA pairs, constructed on three tasks with progressed difficulties, from basic visual recognition to higher-level cultural conflict understanding, and finally to cultural content generation. We conducted evaluations on 11 open-source MLLMs, revealing a significant performance gap between MLLMs and human performance. The gap demonstrates that CB poses substantial challenges for current MLLMs, encouraging future research to advance the cultural awareness capabilities of MLLMs.

Paper Structure

This paper contains 37 sections, 2 equations, 17 figures, 7 tables.

Figures (17)

  • Figure 1: Comparison between C$^3$B and previous culture awareness capability benchmarks. In comparison with existing benchmarks for cultural awareness capabilities, C$^3$B is compatible with multicultural, multilingual, and multitask contexts, thereby facilitating a more thorough evaluation.
  • Figure 2: Overview of C$^3$B. C$^3$B evaluates MLLMs across three dimensions: Object Identification (foundational vision capability based on culture), Conflict Identification (cultural conflict understanding), and Culturally-aligned Content Generation (comprehensive cultural generation).
  • Figure 3: The construction process of C$^3$B. The process contains 3 steps: Comics Generation, Annotation for Extraction@Culture and Conflict@Culture and Annotation for Generation@Culture.
  • Figure 4: The cultures C$^3$B covers are presented in a world map. Regions shaded in blue indicate that the culture is included in C$^3$B.
  • Figure 5: Scores of QA Pairs in different culture.
  • ...and 12 more figures