Investigating Annotator Bias in Large Language Models for Hate Speech Detection

Amit Das; Zheng Zhang; Najib Hasan; Souvika Sarkar; Fatemeh Jamshidi; Tathagata Bhattacharya; Mostafa Rahgouy; Nilanjana Raychawdhary; Dongji Feng; Vinija Jain; Aman Chadha; Mary Sandage; Lauramarie Pope; Gerry Dozier; Cheryl Seals

Investigating Annotator Bias in Large Language Models for Hate Speech Detection

Amit Das, Zheng Zhang, Najib Hasan, Souvika Sarkar, Fatemeh Jamshidi, Tathagata Bhattacharya, Mostafa Rahgouy, Nilanjana Raychawdhary, Dongji Feng, Vinija Jain, Aman Chadha, Mary Sandage, Lauramarie Pope, Gerry Dozier, Cheryl Seals

TL;DR

This paper delves into the biases present in LLMs when annotating hate speech data, and conducts a comprehensive examination of potential factors contributing to these biases by scrutinizing the annotated data.

Abstract

Data annotation, the practice of assigning descriptive labels to raw data, is pivotal in optimizing the performance of machine learning models. However, it is a resource-intensive process susceptible to biases introduced by annotators. The emergence of sophisticated Large Language Models (LLMs) presents a unique opportunity to modernize and streamline this complex procedure. While existing research extensively evaluates the efficacy of LLMs, as annotators, this paper delves into the biases present in LLMs when annotating hate speech data. Our research contributes to understanding biases in four key categories: gender, race, religion, and disability with four LLMs: GPT-3.5, GPT-4o, Llama-3.1 and Gemma-2. Specifically targeting highly vulnerable groups within these categories, we analyze annotator biases. Furthermore, we conduct a comprehensive examination of potential factors contributing to these biases by scrutinizing the annotated data. We introduce our custom hate speech detection dataset, HateBiasNet, to conduct this research. Additionally, we perform the same experiments on the ETHOS (Mollas et al. 2022) dataset also for comparative analysis. This paper serves as a crucial resource, guiding researchers and practitioners in harnessing the potential of LLMs for data annotation, thereby fostering advancements in this critical field.

Investigating Annotator Bias in Large Language Models for Hate Speech Detection

TL;DR

Abstract

Paper Structure (20 sections, 6 figures, 4 tables)

This paper contains 20 sections, 6 figures, 4 tables.

Introduction
Related Work
Methodologies
Data Collection and Annotation
Data Annotation by LLMs
Annotator Biases
Results & Discussion
GPT-3.5
GPT-4o:
Llama-3.1:
Gemma-2:
Conclusion
Appendix
HateBiasNet Details
Student Data Annotation Instructions
...and 5 more sections

Figures (6)

Figure 1: Workflow diagram of our study, illustrating how varying biases can lead to different outcomes when annotating a sample text as hateful. We investigate annotator biases across four categories for hate speech detection using the following LLMs: GPT-3.5, GPT-4o, Llama-3.1, and Gemma-2.
Figure 2: Heatmap of the HateBiasNet dataset illustrating the accuracy of 11 biases across 4 LLMs. Notably, GPT-3.5, GPT-4o, and Llama-3.1 demonstrate the highest accuracy for the 'Mental disability' bias. The word cloud of the dataset (Figure \ref{['fig:ours_dataset']}) suggests that specific keywords may influence annotation outcomes for these LLMs. Additionally, Llama-3.1 shows the highest accuracy overall for the 'Mental disability' bias among the 4 models.
Figure 3: Heatmap of the ETHOS dataset depicting the accuracy of 11 biases across 4 LLMs. The bias 'Black' achieved the highest accuracy for both GPT-3.5 and GPT-4o, while 'Asian' exhibited the highest accuracy for Llama-3.1. The word cloud of the dataset (Figure \ref{['fig:ours_dataset']}) suggests that specific keywords may influence annotation results for these LLMs. Notably, GPT-4o and Llama-3.1 consistently outperformed GPT-3.5 and Gemma-2 across all biases. Among the LLMs, GPT-4o's performance on the 'Black' bias stands out as the highest overall accuracy.
Figure 4: Word histogram (considering only the top 5 words) of (a) HateBiasNet and (b) ETHOS after removing the stopwords.
Figure 5: Line graph of the HateBiasNet dataset displaying 11 biases across the 4 LLMs. Notably, for GPT-3.5, GPT-4o, and Llama-3.1, the 'mental disability' bias achieved the highest accuracy. The word cloud of the dataset (Figure \ref{['fig:ours_dataset']}) suggests that the presence of specific keywords may influence the annotation outcomes for these three LLMs.
...and 1 more figures

Investigating Annotator Bias in Large Language Models for Hate Speech Detection

TL;DR

Abstract

Investigating Annotator Bias in Large Language Models for Hate Speech Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (6)