Table of Contents
Fetching ...

From Native Memes to Global Moderation: Cros-Cultural Evaluation of Vision-Language Models for Hateful Meme Detection

Mo Wang, Kaixuan Ren, Pratik Jalan, Ahmed Ashraf, Tuong Vy Vu, Rahul Seetharaman, Shah Nawaz, Usman Naseem

TL;DR

This work interrogates how cultural context shapes hateful meme detection by evaluating state-of-the-art vision-language models across six native meme datasets. It introduces a multidimensional evaluation framework that jointly varies learning (zero-shot vs one-shot), prompt language (native vs English), and translation effects, using a translated-caption protocol to isolate language influence. The study finds that translate-then-detect pipelines underperform, while native-language prompting and one-shot learning substantially improve cross-cultural robustness, revealing a Western-centric bias in many VLMs. It further provides qualitative diagnostics of failure modes, such as semantic distortion and representation gaps, and proposes actionable strategies, including a hybrid deployment approach that combines locally fine-tuned models with general-purpose VLMs to achieve globally robust moderation. The results underscore the importance of culturally aware evaluation and intervention design for fair and effective global content moderation.

Abstract

Cultural context profoundly shapes how people interpret online content, yet vision-language models (VLMs) remain predominantly trained through Western or English-centric lenses. This limits their fairness and cross-cultural robustness in tasks like hateful meme detection. We introduce a systematic evaluation framework designed to diagnose and quantify the cross-cultural robustness of state-of-the-art VLMs across multilingual meme datasets, analyzing three axes: (i) learning strategy (zero-shot vs. one-shot), (ii) prompting language (native vs. English), and (iii) translation effects on meaning and detection. Results show that the common ``translate-then-detect'' approach deteriorate performance, while culturally aligned interventions - native-language prompting and one-shot learning - significantly enhance detection. Our findings reveal systematic convergence toward Western safety norms and provide actionable strategies to mitigate such bias, guiding the design of globally robust multimodal moderation systems.

From Native Memes to Global Moderation: Cros-Cultural Evaluation of Vision-Language Models for Hateful Meme Detection

TL;DR

This work interrogates how cultural context shapes hateful meme detection by evaluating state-of-the-art vision-language models across six native meme datasets. It introduces a multidimensional evaluation framework that jointly varies learning (zero-shot vs one-shot), prompt language (native vs English), and translation effects, using a translated-caption protocol to isolate language influence. The study finds that translate-then-detect pipelines underperform, while native-language prompting and one-shot learning substantially improve cross-cultural robustness, revealing a Western-centric bias in many VLMs. It further provides qualitative diagnostics of failure modes, such as semantic distortion and representation gaps, and proposes actionable strategies, including a hybrid deployment approach that combines locally fine-tuned models with general-purpose VLMs to achieve globally robust moderation. The results underscore the importance of culturally aware evaluation and intervention design for fair and effective global content moderation.

Abstract

Cultural context profoundly shapes how people interpret online content, yet vision-language models (VLMs) remain predominantly trained through Western or English-centric lenses. This limits their fairness and cross-cultural robustness in tasks like hateful meme detection. We introduce a systematic evaluation framework designed to diagnose and quantify the cross-cultural robustness of state-of-the-art VLMs across multilingual meme datasets, analyzing three axes: (i) learning strategy (zero-shot vs. one-shot), (ii) prompting language (native vs. English), and (iii) translation effects on meaning and detection. Results show that the common ``translate-then-detect'' approach deteriorate performance, while culturally aligned interventions - native-language prompting and one-shot learning - significantly enhance detection. Our findings reveal systematic convergence toward Western safety norms and provide actionable strategies to mitigate such bias, guiding the design of globally robust multimodal moderation systems.
Paper Structure (45 sections, 5 figures, 6 tables)

This paper contains 45 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Overall workflow of our multilingual hateful meme detection evaluation. Starting from native memes (e.g., Arabic), we apply optional machine translation (MT) into multiple target languages, followed by the construction of different prompt types (English/Native × Zero-Shot/One-Shot). These prompts are then fed into VLMs, whose predictions and explanations are used to assess cross-lingual cultural robustness under different interaction strategies.
  • Figure 2: Heatmaps for zero-shot performance across seven VLMs.
  • Figure 3: Heatmaps One-shot performance across seven VLMs.
  • Figure 4: Task-specific transfer heatmaps.
  • Figure 5: Arabic meme illustrating Semantic Flip failure case.