Exploring the Limits of Zero Shot Vision Language Models for Hate Meme Detection: The Vulnerabilities and their Interpretations

Naquee Rizwan; Paramananda Bhaskar; Mithun Das; Swadhin Satyaprakash Majhi; Punyajoy Saha; Animesh Mukherjee

Exploring the Limits of Zero Shot Vision Language Models for Hate Meme Detection: The Vulnerabilities and their Interpretations

Naquee Rizwan, Paramananda Bhaskar, Mithun Das, Swadhin Satyaprakash Majhi, Punyajoy Saha, Animesh Mukherjee

TL;DR

The paper interrogates zero-shot vision-language models for hate meme detection, applying thorough prompt engineering across six multilingual datasets and six models. It introduces a novel superpixel occlusion method for interpretable misclassification analysis and develops an error-typology framework to guide safety guardrails without finetuning. GPT-4o emerges as the strongest model overall, with open-source variants lagging on multilingual data, underscoring the need for robust prompts and governance in automated moderation. Collectively, the work offers a principled approach to model selection, prompt design, and typology-driven safeguards to improve hate meme detection systems in practice.

Abstract

There is a rapid increase in the use of multimedia content in current social media platforms. One of the highly popular forms of such multimedia content are memes. While memes have been primarily invented to promote funny and buoyant discussions, malevolent users exploit memes to target individuals or vulnerable communities, making it imperative to identify and address such instances of hateful memes. Thus social media platforms are in dire need for active moderation of such harmful content. While manual moderation is extremely difficult due to the scale of such content, automatic moderation is challenged by the need of good quality annotated data to train hate meme detection algorithms. This makes a perfect pretext for exploring the power of modern day vision language models (VLMs) that have exhibited outstanding performance across various tasks. In this paper we study the effectiveness of VLMs in handling intricate tasks such as hate meme detection in a completely zero-shot setting so that there is no dependency on annotated data for the task. We perform thorough prompt engineering and query state-of-the-art VLMs using various prompt types to detect hateful/harmful memes. We further interpret the misclassification cases using a novel superpixel based occlusion method. Finally we show that these misclassifications can be neatly arranged into a typology of error classes the knowledge of which should enable the design of better safety guardrails in future.

Exploring the Limits of Zero Shot Vision Language Models for Hate Meme Detection: The Vulnerabilities and their Interpretations

TL;DR

Abstract

Paper Structure (29 sections, 7 figures, 8 tables)

This paper contains 29 sections, 7 figures, 8 tables.

Introduction
Related works
Datasets and metrics
Models
Prompts
Experimental setup
Results
Error analysis
Occlusion based result interpretation
Actionable evaluation: Typology of the error cases
Action items
Conclusion
Limitations
Ethics statement
Definitions
...and 14 more sections

Figures (7)

Figure 1: Pipeline: A concise summary and flow of the evaluation and analysis carried out in this work.
Figure 2: Typology: Green circles represent misclassification to positive label; red signifies misclassification to negative label. Each set of misclassification is bifurcated into two clusters. Distribution of cases, topic words and representative image cluster are shown for GPT-4o and LLaVA-1.5 13B models, for FHM, MAMI and HARM-C + P datasets. Enlarged image clusters and results for BHM and HinGlish are in Appendix. Important keywords in each topic are marked in bold.
Figure 3: Typology Clusters: Upper panel images: GPT-4o error clusters. Lower panel images: LLaVA-1.5 13B error clusters. Each cluster and dataset are separated by dashed lines.
Figure 4: Typology Extended: Green circles represent misclassification to positive label; red signifies misclassification to negative label. Each set of misclassification is bifurcated into two clusters. Distribution of cases, topic words and representative image cluster are shown for GPT-4o and LLaVA-1.5 13B models, for BHM and HinGlish datasets. Important keywords in each topic are marked in bold.
Figure 5: Examples of Wrong Annotation: Fifteen examples, nine from MAMI and six from FHM are shown for CASE 2 of GPT-4o with def + OCR as input and explanation as output. Output of the model is also provided.
...and 2 more figures

Exploring the Limits of Zero Shot Vision Language Models for Hate Meme Detection: The Vulnerabilities and their Interpretations

TL;DR

Abstract

Exploring the Limits of Zero Shot Vision Language Models for Hate Meme Detection: The Vulnerabilities and their Interpretations

Authors

TL;DR

Abstract

Table of Contents

Figures (7)