Harmful Speech Detection by Language Models Exhibits Gender-Queer Dialect Bias

Rebecca Dorn; Lee Kezar; Fred Morstatter; Kristina Lerman

Harmful Speech Detection by Language Models Exhibits Gender-Queer Dialect Bias

Rebecca Dorn, Lee Kezar, Fred Morstatter, Kristina Lerman

TL;DR

This work addresses biases in harmful speech detection against gender-queer dialects by introducing QueerReclaimLex, a dataset of 109 templates featuring reclaimed LGBTQ+ slurs annotated by gender-queer individuals. It systematically evaluates five off-the-shelf models and three prompting schemas (vanilla, identity, identity-cot) to determine how author identity context affects harm judgments, revealing a consistent ingroup bias and low performance for ingroup posts (e.g., $F1$ scores often well below 0.5). The study finds that even with identity context and chain-of-thought prompting, large language models struggle to correctly interpret reclaimed slurs, suggesting the need for fairer moderation that moves beyond keyword reliance and incorporates community input. Overall, the work highlights significant fairness gaps in current moderation systems and proposes directions for dataset expansion, model alignment, and inclusive evaluation to foster more equitable online spaces for gender-queer users.

Abstract

Content moderation on social media platforms shapes the dynamics of online discourse, influencing whose voices are amplified and whose are suppressed. Recent studies have raised concerns about the fairness of content moderation practices, particularly for aggressively flagging posts from transgender and non-binary individuals as toxic. In this study, we investigate the presence of bias in harmful speech classification of gender-queer dialect online, focusing specifically on the treatment of reclaimed slurs. We introduce a novel dataset, QueerReclaimLex, based on 109 curated templates exemplifying non-derogatory uses of LGBTQ+ slurs. Dataset instances are scored by gender-queer annotators for potential harm depending on additional context about speaker identity. We systematically evaluate the performance of five off-the-shelf language models in assessing the harm of these texts and explore the effectiveness of chain-of-thought prompting to teach large language models (LLMs) to leverage author identity context. We reveal a tendency for these models to inaccurately flag texts authored by gender-queer individuals as harmful. Strikingly, across all LLMs the performance is poorest for texts that show signs of being written by individuals targeted by the featured slur (F1 <= 0.24). We highlight an urgent need for fairness and inclusivity in content moderation systems. By uncovering these biases, this work aims to inform the development of more equitable content moderation practices and contribute to the creation of inclusive online spaces for all users.

Harmful Speech Detection by Language Models Exhibits Gender-Queer Dialect Bias

TL;DR

scores often well below 0.5). The study finds that even with identity context and chain-of-thought prompting, large language models struggle to correctly interpret reclaimed slurs, suggesting the need for fairer moderation that moves beyond keyword reliance and incorporates community input. Overall, the work highlights significant fairness gaps in current moderation systems and proposes directions for dataset expansion, model alignment, and inclusive evaluation to foster more equitable online spaces for gender-queer users.

Abstract

Paper Structure (28 sections, 1 equation, 7 figures, 3 tables)

This paper contains 28 sections, 1 equation, 7 figures, 3 tables.

Introduction
Related Work
Trans and Non-Binary Dialects
Linguistics of Slurs
Gender Variance in NLP
Defining and Detecting Harmful Speech
Methods
QueerReclaimLex Dataset
Template Creation
Gender-queer Slurs
Annotator Recruitment & Demographics
Annotation Fields
Harm Classification
Toxicity Classifier Selection
Large Language Model Selection
...and 13 more sections

Figures (7)

Figure 1: Three prompting schemas vanilla, identity and identity-cot that are used to elicit toxicity scores from our models. Each schema introduces an additional aspect of context to the model. Bold fields include examples.
Figure 2: Illustrative example of how the terms ingroup and outgroup are used in the scope of this paper.
Figure 3: Examples of how tweets from gender-queer authors become templates, and how those templates translate to instances of QueerReclaimLex. The original reclaimed slurs are in purple, positions for slurs are in green and inserted slurs are in blue.
Figure 4: Blue text contains identity prompting schema formatted for LLaMA 2. The post featured is an example post. When the instance contains the purple text after the blue, the prompt becomes a version of our identity-cot prompting schema with only one example rather than four.
Figure 5: Frequency of SLUR USAGE depending on expert-obtained harm scores. Ingroup posts are far less likely to be harmful.
...and 2 more figures

Theorems & Definitions (1)

definition 1

Harmful Speech Detection by Language Models Exhibits Gender-Queer Dialect Bias

TL;DR

Abstract

Harmful Speech Detection by Language Models Exhibits Gender-Queer Dialect Bias

Authors

TL;DR

Abstract

Table of Contents

Figures (7)

Theorems & Definitions (1)