Can Thinking Models Think to Detect Hateful Memes?

Mohamed Bayan Kmainasi; Mucahid Kutlu; Ali Ezzat Shahroor; Abul Hasnat; Firoj Alam

Can Thinking Models Think to Detect Hateful Memes?

Mohamed Bayan Kmainasi, Mucahid Kutlu, Ali Ezzat Shahroor, Abul Hasnat, Firoj Alam

TL;DR

A reinforcement learning based post-training framework that improves reasoning in thinking-based MLLMs through task-specific rewards and a novel Group Relative Policy Optimization (GRPO) objective that jointly optimizes meme classification and explanation quality to encourage fine-grained, step-by-step reasoning.

Abstract

Hateful memes often require compositional multimodal reasoning: the image and text may appear benign in isolation, yet their interaction conveys harmful intent. Although thinking-based multimodal large language models (MLLMs) have recently advanced vision-language understanding, their capabilities remain underexplored for hateful meme analysis. We propose a reinforcement learning based post-training framework that improves reasoning in thinking-based MLLMs through task-specific rewards and a novel Group Relative Policy Optimization (GRPO) objective. Specifically, we (i) conduct a systematic empirical study of off-the-shelf MLLMs for hateful meme understanding, (ii) extend an existing hateful meme dataset by generating weakly or pseudo-supervised chain-of-thought rationales via distillation, and (iii) introduce a GRPO-based objective that jointly optimizes meme classification and explanation quality to encourage fine-grained, step-by-step reasoning. Experiments on the Hateful Memes benchmark show that our approach achieves state-of-the-art performance, improving accuracy and F1 by approximately 1 percent and explanation quality by approximately 3 percent. We will publicly release our code, dataset extensions, and evaluation resources to support reproducibility.

Can Thinking Models Think to Detect Hateful Memes?

TL;DR

Abstract

Paper Structure (37 sections, 8 equations, 2 figures, 5 tables)

This paper contains 37 sections, 8 equations, 2 figures, 5 tables.

Introduction
Related Work
Multimodal hateful meme detection
RL for MLLMs
Datasets
Dataset Preparation
Hateful Memes -- Original Dataset
Explanation/Reasoning
Fine-Grained Labels
Step-by-Step Reasoning -- Our Extension
CoT Evaluation with LLM-as-a-Judge.
Methodology
Framework Overview.
Task Formulation
Training thinking-based MLLMs
...and 22 more sections

Figures (2)

Figure 1: Overview of the proposed hateful meme analysis framework. Starting from raw memes, we derive binary and fine-grained supervision (protected category and attack type), OCR text, and guidelines to construct instruction-following datasets. Weak supervision from a strong MLLM is used to distill step-by-step CoT rationales. The model is trained via a two-stage post-training pipeline consisting of SFT warm-up followed by GRPO-based reinforcement learning, jointly optimizing classification accuracy and explanation quality. During inference, multiple candidate label–explanation pairs are generated and scored to select the final prediction. Abbreviations: Acc. = accuracy; MET = METEOR; Len = explanation length; Fmt = format compliance.
Figure 2: GRPO training dynamics under different initializations, showing mean group reward (left axis) and mean completion length (right axis). Moving average smoothing with window size of 1000 is applied.

Can Thinking Models Think to Detect Hateful Memes?

TL;DR

Abstract

Can Thinking Models Think to Detect Hateful Memes?

Authors

TL;DR

Abstract

Table of Contents

Figures (2)