Hate Personified: Investigating the role of LLMs in content moderation

Sarah Masud; Sahajpreet Singh; Viktor Hangya; Alexander Fraser; Tanmoy Chakraborty

Hate Personified: Investigating the role of LLMs in content moderation

Sarah Masud, Sahajpreet Singh, Viktor Hangya, Alexander Fraser, Tanmoy Chakraborty

TL;DR

This study investigates how large language models (LLMs) respond to contextual prompts in hate-speech annotation across five languages and six datasets, focusing on geographical cues, annotator personas, and numerical anchoring. Using zero-shot prompting with two models, FlanT5-XXL and GPT-3.5, it shows that geographic cues improve human-LLM alignment, while persona cues can induce variability and numerical anchors can bias outputs. The work emphasizes that LLMs should assist human moderators rather than replace them, and it offers practical guidelines to mitigate bias in multilingual content moderation. The findings highlight the importance of transparency in LLM training and prompting design when deploying AI-assisted moderation in culturally diverse settings.

Abstract

For subjective tasks such as hate detection, where people perceive hate differently, the Large Language Model's (LLM) ability to represent diverse groups is unclear. By including additional context in prompts, we comprehensively analyze LLM's sensitivity to geographical priming, persona attributes, and numerical information to assess how well the needs of various groups are reflected. Our findings on two LLMs, five languages, and six datasets reveal that mimicking persona-based attributes leads to annotation variability. Meanwhile, incorporating geographical signals leads to better regional alignment. We also find that the LLMs are sensitive to numerical anchors, indicating the ability to leverage community-based flagging efforts and exposure to adversaries. Our work provides preliminary guidelines and highlights the nuances of applying LLMs in culturally sensitive cases.

Hate Personified: Investigating the role of LLMs in content moderation

TL;DR

Abstract

Paper Structure (14 sections, 6 figures, 12 tables)

This paper contains 14 sections, 6 figures, 12 tables.

Introduction
Related Work
Experimental Setup
Do LLMs Pick on Geographical Cues?
Can LLMs Mimic Annotator Persona?
Are LLMs Sensitive to Anchoring Bias?
Discussion
Conclusion
Limitations
Ethical Considerations
RQ1 Multilingual Prompting
RQ2 Multilingual Persona Prompting
RQ3 Temperature Probing
Statistical Testing on FlanT5-XXL

Figures (6)

Figure 1: Annotations of hate/non-hate (red/green) for USA, Australia, UK, South Africa, and Singapore, by a) annotators from respective countries (circle) and b) prompting GPT-3.5 with 'The following statement was made in <country>: <POST>. Is the given statement hateful?' (square). The posts and human labels are verbatim from the CREHate dataset lee2023crehate.
Figure 2: Overview of the research pipeline: an incoming post is prefixed with context to form the prompt for the LLM. The predicted label is then evaluated against ground truth to examine variability arising from context.
Figure 3: [RQ1] (a-b) The IAA w.r.t human annotation for each country FlanT5-XXL and GPT-3.5, respectively, for English posts. (c) Captures each language's IAA w.r.t human labels via GPT-3.5 with posts in the language and prompts in English. Here, without (w/o) is $p_{base}$ and with (w/) is $p_{con}$/$p_{lang}$ + $p_{base}$.
Figure 4: [RQ2] Predicted hate label ratio (PHLR) from GPT-3.5 comparing $p_{trait}^{L_*}$ + $p_{base}$ for Arabic, French, German, and Hindi. (a) and (b) capture the base vs. vulnerable persona for hate/non-hate queries, respectively. (c) and (d) capture the native vs. non-native speaker persona for hate/non-hate queries, respectively.
Figure 5: [RQ3] For the $p_{vote}^H$ : (a) and (b) capture the IAA among various hateful voting percentages, i.e., $z \%$ for FlanT5-XXL and GPT-3.5, respectively. For $p_{vote}^N$: (c), (d) function analogously to (a), (b), respectively.
...and 1 more figures

Hate Personified: Investigating the role of LLMs in content moderation

TL;DR

Abstract

Hate Personified: Investigating the role of LLMs in content moderation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)