Table of Contents
Fetching ...

Can Thinking Models Think to Detect Hateful Memes?

Mohamed Bayan Kmainasi, Mucahid Kutlu, Ali Ezzat Shahroor, Abul Hasnat, Firoj Alam

TL;DR

A reinforcement learning based post-training framework that improves reasoning in thinking-based MLLMs through task-specific rewards and a novel Group Relative Policy Optimization (GRPO) objective that jointly optimizes meme classification and explanation quality to encourage fine-grained, step-by-step reasoning.

Abstract

Hateful memes often require compositional multimodal reasoning: the image and text may appear benign in isolation, yet their interaction conveys harmful intent. Although thinking-based multimodal large language models (MLLMs) have recently advanced vision-language understanding, their capabilities remain underexplored for hateful meme analysis. We propose a reinforcement learning based post-training framework that improves reasoning in thinking-based MLLMs through task-specific rewards and a novel Group Relative Policy Optimization (GRPO) objective. Specifically, we (i) conduct a systematic empirical study of off-the-shelf MLLMs for hateful meme understanding, (ii) extend an existing hateful meme dataset by generating weakly or pseudo-supervised chain-of-thought rationales via distillation, and (iii) introduce a GRPO-based objective that jointly optimizes meme classification and explanation quality to encourage fine-grained, step-by-step reasoning. Experiments on the Hateful Memes benchmark show that our approach achieves state-of-the-art performance, improving accuracy and F1 by approximately 1 percent and explanation quality by approximately 3 percent. We will publicly release our code, dataset extensions, and evaluation resources to support reproducibility.

Can Thinking Models Think to Detect Hateful Memes?

TL;DR

A reinforcement learning based post-training framework that improves reasoning in thinking-based MLLMs through task-specific rewards and a novel Group Relative Policy Optimization (GRPO) objective that jointly optimizes meme classification and explanation quality to encourage fine-grained, step-by-step reasoning.

Abstract

Hateful memes often require compositional multimodal reasoning: the image and text may appear benign in isolation, yet their interaction conveys harmful intent. Although thinking-based multimodal large language models (MLLMs) have recently advanced vision-language understanding, their capabilities remain underexplored for hateful meme analysis. We propose a reinforcement learning based post-training framework that improves reasoning in thinking-based MLLMs through task-specific rewards and a novel Group Relative Policy Optimization (GRPO) objective. Specifically, we (i) conduct a systematic empirical study of off-the-shelf MLLMs for hateful meme understanding, (ii) extend an existing hateful meme dataset by generating weakly or pseudo-supervised chain-of-thought rationales via distillation, and (iii) introduce a GRPO-based objective that jointly optimizes meme classification and explanation quality to encourage fine-grained, step-by-step reasoning. Experiments on the Hateful Memes benchmark show that our approach achieves state-of-the-art performance, improving accuracy and F1 by approximately 1 percent and explanation quality by approximately 3 percent. We will publicly release our code, dataset extensions, and evaluation resources to support reproducibility.
Paper Structure (37 sections, 8 equations, 2 figures, 5 tables)

This paper contains 37 sections, 8 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Overview of the proposed hateful meme analysis framework. Starting from raw memes, we derive binary and fine-grained supervision (protected category and attack type), OCR text, and guidelines to construct instruction-following datasets. Weak supervision from a strong MLLM is used to distill step-by-step CoT rationales. The model is trained via a two-stage post-training pipeline consisting of SFT warm-up followed by GRPO-based reinforcement learning, jointly optimizing classification accuracy and explanation quality. During inference, multiple candidate label–explanation pairs are generated and scored to select the final prediction. Abbreviations: Acc. = accuracy; MET = METEOR; Len = explanation length; Fmt = format compliance.
  • Figure 2: GRPO training dynamics under different initializations, showing mean group reward (left axis) and mean completion length (right axis). Moving average smoothing with window size of 1000 is applied.