Table of Contents
Fetching ...

When Tom Eats Kimchi: Evaluating Cultural Bias of Multimodal Large Language Models in Cultural Mixture Contexts

Jun Seong Kim, Kyaw Ye Thu, Javad Ismayilzada, Junyeong Park, Eunsu Kim, Huzama Ahmad, Na Min An, James Thorne, Alice Oh

TL;DR

This work introduces MixCuBe, a cross-cultural benchmark for evaluating cultural bias in multimodal LLMs under mixed-cultural contexts. By creating 2.5k images across five cultures and three cultural markers, and generating four ethnicity-swapped synthesized images per original, the authors assess Country Identification and Cultural Marker Identification across three MLLMs. Results show higher-resource cultures are more robust to ethnicity perturbations, while low-resource cultures exhibit significant accuracy drops, with GPT-4o showing notable sensitivity in Azerbaijan. The dataset and methodology provide a platform for analyzing and mitigating cultural bias in multimodal systems, though limitations such as synthesis artifacts and data contamination are acknowledged and addressed with plans for broader, more diverse future work.

Abstract

In a highly globalized world, it is important for multi-modal large language models (MLLMs) to recognize and respond correctly to mixed-cultural inputs. For example, a model should correctly identify kimchi (Korean food) in an image both when an Asian woman is eating it, as well as an African man is eating it. However, current MLLMs show an over-reliance on the visual features of the person, leading to misclassification of the entities. To examine the robustness of MLLMs to different ethnicity, we introduce MixCuBe, a cross-cultural bias benchmark, and study elements from five countries and four ethnicities. Our findings reveal that MLLMs achieve both higher accuracy and lower sensitivity to such perturbation for high-resource cultures, but not for low-resource cultures. GPT-4o, the best-performing model overall, shows up to 58% difference in accuracy between the original and perturbed cultural settings in low-resource cultures. Our dataset is publicly available at: https://huggingface.co/datasets/kyawyethu/MixCuBe.

When Tom Eats Kimchi: Evaluating Cultural Bias of Multimodal Large Language Models in Cultural Mixture Contexts

TL;DR

This work introduces MixCuBe, a cross-cultural benchmark for evaluating cultural bias in multimodal LLMs under mixed-cultural contexts. By creating 2.5k images across five cultures and three cultural markers, and generating four ethnicity-swapped synthesized images per original, the authors assess Country Identification and Cultural Marker Identification across three MLLMs. Results show higher-resource cultures are more robust to ethnicity perturbations, while low-resource cultures exhibit significant accuracy drops, with GPT-4o showing notable sensitivity in Azerbaijan. The dataset and methodology provide a platform for analyzing and mitigating cultural bias in multimodal systems, though limitations such as synthesis artifacts and data contamination are acknowledged and addressed with plans for broader, more diverse future work.

Abstract

In a highly globalized world, it is important for multi-modal large language models (MLLMs) to recognize and respond correctly to mixed-cultural inputs. For example, a model should correctly identify kimchi (Korean food) in an image both when an Asian woman is eating it, as well as an African man is eating it. However, current MLLMs show an over-reliance on the visual features of the person, leading to misclassification of the entities. To examine the robustness of MLLMs to different ethnicity, we introduce MixCuBe, a cross-cultural bias benchmark, and study elements from five countries and four ethnicities. Our findings reveal that MLLMs achieve both higher accuracy and lower sensitivity to such perturbation for high-resource cultures, but not for low-resource cultures. GPT-4o, the best-performing model overall, shows up to 58% difference in accuracy between the original and perturbed cultural settings in low-resource cultures. Our dataset is publicly available at: https://huggingface.co/datasets/kyawyethu/MixCuBe.

Paper Structure

This paper contains 25 sections, 8 figures, 3 tables.

Figures (8)

  • Figure 1: An example of the experiment where a MLLM is tested on both the original image and a synthesized image where the ethnicity of a person is altered.
  • Figure 2: Image synthesis process with sample pairs of original and synthesized images alongside their corresponding masks
  • Figure 3: Country Identification accuracy on original images and the average over corresponding synthesized images of four ethnicities (colored in pale) for each country-category pair.
  • Figure 4: Heatmap of Country Identification accuracy difference. The value in each cell is the difference in Country Identification accuracy between the original and that of synthesized ethnicity. The red boxes highlight the pairs where the synthesized ethnicity by the inpainting model closely resembles a culture demographic.
  • Figure 5: Cultural Marker Identification accuracy evaluated on Food images.
  • ...and 3 more figures