ExPO-HM: Learning to Explain-then-Detect for Hateful Meme Detection
Jingbiao Mei, Mingsheng Sun, Jinghong Chen, Pengda Qin, Yuhong Li, Da Chen, Bill Byrne
TL;DR
ExPO-HM addresses the need for explainable hateful meme detection by reframing the task as Explain-then-Detect and introducing a training paradigm that mirrors human moderator workflows. It combines SFT warmup on policy-manual augmented data, curriculum-guided GRPO, and a Conditional Decision Entropy reward to improve both the quality of explanations and the final decisions. Across three benchmarks, ExPO-HM achieves state-of-the-art performance in binary detection, fine-grained classification, and reasoning quality, with substantial F1 gains over prior post-training baselines and stronger calibrated decisions. The approach yields interpretable, actionable moderation support by providing reliable explanations alongside accurate predictions, potentially enhancing real-world content moderation. These results establish a practical, scalable framework for explanation-driven moderation in multimodal settings and offer insights into improving reasoning with calibrated, policy-aligned guidance.
Abstract
Hateful memes have emerged as a particularly challenging form of online abuse, motivating the development of automated detection systems. Most prior approaches rely on direct detection, producing only binary predictions. Such models fail to provide the context and explanations that real-world moderation requires. Recent Explain-then-Detect approaches, using Chain-of-Thought prompting or LMM agents, perform worse than simple SFT baselines, and even advanced post-training methods such as GRPO fail to close the gap. Our analysis identifies two key issues of such systems: important policy-relevant cues such as targets and attack types are not hypothesized by the model as a likely explanation; and the binary reward signal is insufficient to guide reasoning. To address these challenges, we propose ExPO-HM (Explain-then-Detect Policy Optimization for Hateful Memes), inspired by the training and evaluation process of human annotators. ExPO-HM combines SFT warmup, GRPO with curriculum learning, and Conditional Decision Entropy (CDE) as both metric and reward for reasoning quality. Across three hateful meme benchmarks, ExPO-HM achieves state-of-the-art performance on binary detection, fine-grained classification, and reasoning quality, with up to 15\% and 17\% F1 improvement over the GRPO and DPO baselines, respectively. By moving hateful meme detection from simple binary alarms to explanation-driven detection, ExPO-HM provides accurate, interpretable, and actionable moderation support.
