Table of Contents
Fetching ...

Automated Hate Speech Detection and the Problem of Offensive Language

Thomas Davidson, Dana Warmsley, Michael Macy, Ingmar Weber

TL;DR

This work tackles the difficulty of distinguishing hate speech from offensive language on social media by adopting a three-way classification scheme (hate speech, offensive language, neither) and collecting crowd-sourced annotations for tweets containing Hatebase terms. It engineers a rich feature set (TF-IDF n-grams, POS features, readability, sentiment, and meta cues) and trains a one-vs-rest logistic regression classifier, achieving strong overall metrics but revealing substantial challenges in detecting hate speech, especially when explicit keywords are absent or contextual usage differs. The analysis shows that explicit slurs robustly signal hate speech, but many hate cases are missed or misclassified when terms are not used in overtly hateful ways, underscoring the need for context-aware data and models. The findings argue for separating hate speech from offensive language in detection systems and for future work to address domain bias and the heterogeneity of hate speech expressions, with implications for policy and platform moderation.

Abstract

A key challenge for automatic hate-speech detection on social media is the separation of hate speech from other instances of offensive language. Lexical detection methods tend to have low precision because they classify all messages containing particular terms as hate speech and previous work using supervised learning has failed to distinguish between the two categories. We used a crowd-sourced hate speech lexicon to collect tweets containing hate speech keywords. We use crowd-sourcing to label a sample of these tweets into three categories: those containing hate speech, only offensive language, and those with neither. We train a multi-class classifier to distinguish between these different categories. Close analysis of the predictions and the errors shows when we can reliably separate hate speech from other offensive language and when this differentiation is more difficult. We find that racist and homophobic tweets are more likely to be classified as hate speech but that sexist tweets are generally classified as offensive. Tweets without explicit hate keywords are also more difficult to classify.

Automated Hate Speech Detection and the Problem of Offensive Language

TL;DR

This work tackles the difficulty of distinguishing hate speech from offensive language on social media by adopting a three-way classification scheme (hate speech, offensive language, neither) and collecting crowd-sourced annotations for tweets containing Hatebase terms. It engineers a rich feature set (TF-IDF n-grams, POS features, readability, sentiment, and meta cues) and trains a one-vs-rest logistic regression classifier, achieving strong overall metrics but revealing substantial challenges in detecting hate speech, especially when explicit keywords are absent or contextual usage differs. The analysis shows that explicit slurs robustly signal hate speech, but many hate cases are missed or misclassified when terms are not used in overtly hateful ways, underscoring the need for context-aware data and models. The findings argue for separating hate speech from offensive language in detection systems and for future work to address domain bias and the heterogeneity of hate speech expressions, with implications for policy and platform moderation.

Abstract

A key challenge for automatic hate-speech detection on social media is the separation of hate speech from other instances of offensive language. Lexical detection methods tend to have low precision because they classify all messages containing particular terms as hate speech and previous work using supervised learning has failed to distinguish between the two categories. We used a crowd-sourced hate speech lexicon to collect tweets containing hate speech keywords. We use crowd-sourcing to label a sample of these tweets into three categories: those containing hate speech, only offensive language, and those with neither. We train a multi-class classifier to distinguish between these different categories. Close analysis of the predictions and the errors shows when we can reliably separate hate speech from other offensive language and when this differentiation is more difficult. We find that racist and homophobic tweets are more likely to be classified as hate speech but that sexist tweets are generally classified as offensive. Tweets without explicit hate keywords are also more difficult to classify.

Paper Structure

This paper contains 7 sections, 1 figure.

Figures (1)

  • Figure 1: True versus predicted categories