Explainability and Hate Speech: Structured Explanations Make Social Media Moderators Faster
Agostina Calabrese, Leonardo Neves, Neil Shah, Maarten W. Bos, Björn Ross, Mirella Lapata, Francesco Barbieri
TL;DR
This study asks whether explanations can speed professional hate-speech moderators. By comparing post-only, generic rule-based explanations, and structured, post-specific explanations using goldParse-tree annotations from PLEAD, the authors show that structured explanations cut per-post decision time by 1.34 seconds (about 7.4%) without reducing accuracy, while generic explanations yield no speed benefit. A follow-up moderator survey reveals a strong preference for structured explanations. These findings suggest that deploying structured explanations in moderation tools can meaningfully boost throughput on large platforms, guiding future development of explainable abuse-detection systems.
Abstract
Content moderators play a key role in keeping the conversation on social media healthy. While the high volume of content they need to judge represents a bottleneck to the moderation pipeline, no studies have explored how models could support them to make faster decisions. There is, by now, a vast body of research into detecting hate speech, sometimes explicitly motivated by a desire to help improve content moderation, but published research using real content moderators is scarce. In this work we investigate the effect of explanations on the speed of real-world moderators. Our experiments show that while generic explanations do not affect their speed and are often ignored, structured explanations lower moderators' decision making time by 7.4%.
