Table of Contents
Fetching ...

Harmful Speech Detection by Language Models Exhibits Gender-Queer Dialect Bias

Rebecca Dorn, Lee Kezar, Fred Morstatter, Kristina Lerman

TL;DR

This work addresses biases in harmful speech detection against gender-queer dialects by introducing QueerReclaimLex, a dataset of 109 templates featuring reclaimed LGBTQ+ slurs annotated by gender-queer individuals. It systematically evaluates five off-the-shelf models and three prompting schemas (vanilla, identity, identity-cot) to determine how author identity context affects harm judgments, revealing a consistent ingroup bias and low performance for ingroup posts (e.g., $F1$ scores often well below 0.5). The study finds that even with identity context and chain-of-thought prompting, large language models struggle to correctly interpret reclaimed slurs, suggesting the need for fairer moderation that moves beyond keyword reliance and incorporates community input. Overall, the work highlights significant fairness gaps in current moderation systems and proposes directions for dataset expansion, model alignment, and inclusive evaluation to foster more equitable online spaces for gender-queer users.

Abstract

Content moderation on social media platforms shapes the dynamics of online discourse, influencing whose voices are amplified and whose are suppressed. Recent studies have raised concerns about the fairness of content moderation practices, particularly for aggressively flagging posts from transgender and non-binary individuals as toxic. In this study, we investigate the presence of bias in harmful speech classification of gender-queer dialect online, focusing specifically on the treatment of reclaimed slurs. We introduce a novel dataset, QueerReclaimLex, based on 109 curated templates exemplifying non-derogatory uses of LGBTQ+ slurs. Dataset instances are scored by gender-queer annotators for potential harm depending on additional context about speaker identity. We systematically evaluate the performance of five off-the-shelf language models in assessing the harm of these texts and explore the effectiveness of chain-of-thought prompting to teach large language models (LLMs) to leverage author identity context. We reveal a tendency for these models to inaccurately flag texts authored by gender-queer individuals as harmful. Strikingly, across all LLMs the performance is poorest for texts that show signs of being written by individuals targeted by the featured slur (F1 <= 0.24). We highlight an urgent need for fairness and inclusivity in content moderation systems. By uncovering these biases, this work aims to inform the development of more equitable content moderation practices and contribute to the creation of inclusive online spaces for all users.

Harmful Speech Detection by Language Models Exhibits Gender-Queer Dialect Bias

TL;DR

This work addresses biases in harmful speech detection against gender-queer dialects by introducing QueerReclaimLex, a dataset of 109 templates featuring reclaimed LGBTQ+ slurs annotated by gender-queer individuals. It systematically evaluates five off-the-shelf models and three prompting schemas (vanilla, identity, identity-cot) to determine how author identity context affects harm judgments, revealing a consistent ingroup bias and low performance for ingroup posts (e.g., scores often well below 0.5). The study finds that even with identity context and chain-of-thought prompting, large language models struggle to correctly interpret reclaimed slurs, suggesting the need for fairer moderation that moves beyond keyword reliance and incorporates community input. Overall, the work highlights significant fairness gaps in current moderation systems and proposes directions for dataset expansion, model alignment, and inclusive evaluation to foster more equitable online spaces for gender-queer users.

Abstract

Content moderation on social media platforms shapes the dynamics of online discourse, influencing whose voices are amplified and whose are suppressed. Recent studies have raised concerns about the fairness of content moderation practices, particularly for aggressively flagging posts from transgender and non-binary individuals as toxic. In this study, we investigate the presence of bias in harmful speech classification of gender-queer dialect online, focusing specifically on the treatment of reclaimed slurs. We introduce a novel dataset, QueerReclaimLex, based on 109 curated templates exemplifying non-derogatory uses of LGBTQ+ slurs. Dataset instances are scored by gender-queer annotators for potential harm depending on additional context about speaker identity. We systematically evaluate the performance of five off-the-shelf language models in assessing the harm of these texts and explore the effectiveness of chain-of-thought prompting to teach large language models (LLMs) to leverage author identity context. We reveal a tendency for these models to inaccurately flag texts authored by gender-queer individuals as harmful. Strikingly, across all LLMs the performance is poorest for texts that show signs of being written by individuals targeted by the featured slur (F1 <= 0.24). We highlight an urgent need for fairness and inclusivity in content moderation systems. By uncovering these biases, this work aims to inform the development of more equitable content moderation practices and contribute to the creation of inclusive online spaces for all users.
Paper Structure (28 sections, 1 equation, 7 figures, 3 tables)

This paper contains 28 sections, 1 equation, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Three prompting schemas vanilla, identity and identity-cot that are used to elicit toxicity scores from our models. Each schema introduces an additional aspect of context to the model. Bold fields include examples.
  • Figure 2: Illustrative example of how the terms ingroup and outgroup are used in the scope of this paper.
  • Figure 3: Examples of how tweets from gender-queer authors become templates, and how those templates translate to instances of QueerReclaimLex. The original reclaimed slurs are in purple, positions for slurs are in green and inserted slurs are in blue.
  • Figure 4: Blue text contains identity prompting schema formatted for LLaMA 2. The post featured is an example post. When the instance contains the purple text after the blue, the prompt becomes a version of our identity-cot prompting schema with only one example rather than four.
  • Figure 5: Frequency of SLUR USAGE depending on expert-obtained harm scores. Ingroup posts are far less likely to be harmful.
  • ...and 2 more figures

Theorems & Definitions (1)

  • definition 1