Table of Contents
Fetching ...

Hate Personified: Investigating the role of LLMs in content moderation

Sarah Masud, Sahajpreet Singh, Viktor Hangya, Alexander Fraser, Tanmoy Chakraborty

TL;DR

This study investigates how large language models (LLMs) respond to contextual prompts in hate-speech annotation across five languages and six datasets, focusing on geographical cues, annotator personas, and numerical anchoring. Using zero-shot prompting with two models, FlanT5-XXL and GPT-3.5, it shows that geographic cues improve human-LLM alignment, while persona cues can induce variability and numerical anchors can bias outputs. The work emphasizes that LLMs should assist human moderators rather than replace them, and it offers practical guidelines to mitigate bias in multilingual content moderation. The findings highlight the importance of transparency in LLM training and prompting design when deploying AI-assisted moderation in culturally diverse settings.

Abstract

For subjective tasks such as hate detection, where people perceive hate differently, the Large Language Model's (LLM) ability to represent diverse groups is unclear. By including additional context in prompts, we comprehensively analyze LLM's sensitivity to geographical priming, persona attributes, and numerical information to assess how well the needs of various groups are reflected. Our findings on two LLMs, five languages, and six datasets reveal that mimicking persona-based attributes leads to annotation variability. Meanwhile, incorporating geographical signals leads to better regional alignment. We also find that the LLMs are sensitive to numerical anchors, indicating the ability to leverage community-based flagging efforts and exposure to adversaries. Our work provides preliminary guidelines and highlights the nuances of applying LLMs in culturally sensitive cases.

Hate Personified: Investigating the role of LLMs in content moderation

TL;DR

This study investigates how large language models (LLMs) respond to contextual prompts in hate-speech annotation across five languages and six datasets, focusing on geographical cues, annotator personas, and numerical anchoring. Using zero-shot prompting with two models, FlanT5-XXL and GPT-3.5, it shows that geographic cues improve human-LLM alignment, while persona cues can induce variability and numerical anchors can bias outputs. The work emphasizes that LLMs should assist human moderators rather than replace them, and it offers practical guidelines to mitigate bias in multilingual content moderation. The findings highlight the importance of transparency in LLM training and prompting design when deploying AI-assisted moderation in culturally diverse settings.

Abstract

For subjective tasks such as hate detection, where people perceive hate differently, the Large Language Model's (LLM) ability to represent diverse groups is unclear. By including additional context in prompts, we comprehensively analyze LLM's sensitivity to geographical priming, persona attributes, and numerical information to assess how well the needs of various groups are reflected. Our findings on two LLMs, five languages, and six datasets reveal that mimicking persona-based attributes leads to annotation variability. Meanwhile, incorporating geographical signals leads to better regional alignment. We also find that the LLMs are sensitive to numerical anchors, indicating the ability to leverage community-based flagging efforts and exposure to adversaries. Our work provides preliminary guidelines and highlights the nuances of applying LLMs in culturally sensitive cases.
Paper Structure (14 sections, 6 figures, 12 tables)

This paper contains 14 sections, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Annotations of hate/non-hate (red/green) for USA, Australia, UK, South Africa, and Singapore, by a) annotators from respective countries (circle) and b) prompting GPT-3.5 with 'The following statement was made in <country>: <POST>. Is the given statement hateful?' (square). The posts and human labels are verbatim from the CREHate dataset lee2023crehate.
  • Figure 2: Overview of the research pipeline: an incoming post is prefixed with context to form the prompt for the LLM. The predicted label is then evaluated against ground truth to examine variability arising from context.
  • Figure 3: [RQ1] (a-b) The IAA w.r.t human annotation for each country FlanT5-XXL and GPT-3.5, respectively, for English posts. (c) Captures each language's IAA w.r.t human labels via GPT-3.5 with posts in the language and prompts in English. Here, without (w/o) is $p_{base}$ and with (w/) is $p_{con}$/$p_{lang}$ + $p_{base}$.
  • Figure 4: [RQ2] Predicted hate label ratio (PHLR) from GPT-3.5 comparing $p_{trait}^{L_*}$ + $p_{base}$ for Arabic, French, German, and Hindi. (a) and (b) capture the base vs. vulnerable persona for hate/non-hate queries, respectively. (c) and (d) capture the native vs. non-native speaker persona for hate/non-hate queries, respectively.
  • Figure 5: [RQ3] For the $p_{vote}^H$ : (a) and (b) capture the IAA among various hateful voting percentages, i.e., $z \%$ for FlanT5-XXL and GPT-3.5, respectively. For $p_{vote}^N$: (c), (d) function analogously to (a), (b), respectively.
  • ...and 1 more figures