Latent Hatred: A Benchmark for Understanding Implicit Hate Speech
Mai ElSherief, Caleb Ziems, David Muchlinski, Vaishnavi Anupindi, Jordyn Seybolt, Munmun De Choudhury, Diyi Yang
TL;DR
This work addresses the under-explored problem of implicit hate speech by introducing a social-science grounded six-category taxonomy and a large, richly annotated Twitter benchmark that includes fine-grained implicit labels and free-text implied statements. It develops a two-stage annotation process (crowdsourced high-level labeling followed by expert fine-grained labeling), expands the corpus to balance minority classes, and labels target groups and implied meanings for each message. The paper demonstrates that transformer-based models (e.g., BERT) outperform traditional baselines for detection, and it shows promising results for generating explanations of implicit hate using GPT-2, highlighting practical applications for moderation and understandability. It also identifies major challenges in implicit hate detection and outlines future directions to advance modeling, decoding of coded language, and bias mitigation.
Abstract
Hate speech has grown significantly on social media, causing serious consequences for victims of all demographics. Despite much attention being paid to characterize and detect discriminatory speech, most work has focused on explicit or overt hate speech, failing to address a more pervasive form based on coded or indirect language. To fill this gap, this work introduces a theoretically-justified taxonomy of implicit hate speech and a benchmark corpus with fine-grained labels for each message and its implication. We present systematic analyses of our dataset using contemporary baselines to detect and explain implicit hate speech, and we discuss key features that challenge existing models. This dataset will continue to serve as a useful benchmark for understanding this multifaceted issue.
