Harnessing Artificial Intelligence to Combat Online Hate: Exploring the Challenges and Opportunities of Large Language Models in Hate Speech Detection
Tharindu Kumarage, Amrita Bhattacharjee, Joshua Garland
TL;DR
This work evaluates large language models (LLMs) as hate speech detectors, combining a literature review with an empirical study on HateCheck using open-source and proprietary models (Llama-2, Falcon, GPT-3.5). It demonstrates that GPT-3.5 and Llama-2 achieve strong zero-shot performance (roughly 80–90% accuracy) while identifying key weaknesses such as target-specific errors and reliance on spurious cues in some models. The study further shows that simple, direct prompts often outperform more complex prompting strategies like context or chain-of-thought prompts in hate speech detection. The findings offer practical guidance on model choice, prompt design, and evaluation practices to improve reliability and fairness in automated moderation, while highlighting ongoing challenges in robustness and counter-speech contexts.
Abstract
Large language models (LLMs) excel in many diverse applications beyond language generation, e.g., translation, summarization, and sentiment analysis. One intriguing application is in text classification. This becomes pertinent in the realm of identifying hateful or toxic speech -- a domain fraught with challenges and ethical dilemmas. In our study, we have two objectives: firstly, to offer a literature review revolving around LLMs as classifiers, emphasizing their role in detecting and classifying hateful or toxic content. Subsequently, we explore the efficacy of several LLMs in classifying hate speech: identifying which LLMs excel in this task as well as their underlying attributes and training. Providing insight into the factors that contribute to an LLM proficiency (or lack thereof) in discerning hateful content. By combining a comprehensive literature review with an empirical analysis, our paper strives to shed light on the capabilities and constraints of LLMs in the crucial domain of hate speech detection.
