Table of Contents
Fetching ...

Who Speaks Matters: Analysing the Influence of the Speaker's Ethnicity on Hate Classification

Ananya Malik, Kartik Sharma, Shaily Bhatt, Lynnette Hui Xian Ng

TL;DR

This paper investigates how speaker ethnicity markers, both explicit and dialect-based implicit cues, influence hate speech classification by LLMs. It uses two datasets (MPBHSD and HateXplain) and four models to quantify output flips under marker injections, applying ANOVA and McNemar tests to identify drivers of instability. The key finding is that implicit dialect markers trigger more flips than explicit markers, with flip rates varying by ethnicity and model size; overall, larger models show more robustness. The work highlights risks in deploying LLMs for high-stakes moderation across diverse linguistic communities and suggests a need for robust evaluation and mitigation strategies.

Abstract

Large Language Models (LLMs) offer a lucrative promise for scalable content moderation, including hate speech detection. However, they are also known to be brittle and biased against marginalised communities and dialects. This requires their applications to high-stakes tasks like hate speech detection to be critically scrutinized. In this work, we investigate the robustness of hate speech classification using LLMs particularly when explicit and implicit markers of the speaker's ethnicity are injected into the input. For explicit markers, we inject a phrase that mentions the speaker's linguistic identity. For the implicit markers, we inject dialectal features. By analysing how frequently model outputs flip in the presence of these markers, we reveal varying degrees of brittleness across 3 LLMs and 1 LM and 5 linguistic identities. We find that the presence of implicit dialect markers in inputs causes model outputs to flip more than the presence of explicit markers. Further, the percentage of flips varies across ethnicities. Finally, we find that larger models are more robust. Our findings indicate the need for exercising caution in deploying LLMs for high-stakes tasks like hate speech detection.

Who Speaks Matters: Analysing the Influence of the Speaker's Ethnicity on Hate Classification

TL;DR

This paper investigates how speaker ethnicity markers, both explicit and dialect-based implicit cues, influence hate speech classification by LLMs. It uses two datasets (MPBHSD and HateXplain) and four models to quantify output flips under marker injections, applying ANOVA and McNemar tests to identify drivers of instability. The key finding is that implicit dialect markers trigger more flips than explicit markers, with flip rates varying by ethnicity and model size; overall, larger models show more robustness. The work highlights risks in deploying LLMs for high-stakes moderation across diverse linguistic communities and suggests a need for robust evaluation and mitigation strategies.

Abstract

Large Language Models (LLMs) offer a lucrative promise for scalable content moderation, including hate speech detection. However, they are also known to be brittle and biased against marginalised communities and dialects. This requires their applications to high-stakes tasks like hate speech detection to be critically scrutinized. In this work, we investigate the robustness of hate speech classification using LLMs particularly when explicit and implicit markers of the speaker's ethnicity are injected into the input. For explicit markers, we inject a phrase that mentions the speaker's linguistic identity. For the implicit markers, we inject dialectal features. By analysing how frequently model outputs flip in the presence of these markers, we reveal varying degrees of brittleness across 3 LLMs and 1 LM and 5 linguistic identities. We find that the presence of implicit dialect markers in inputs causes model outputs to flip more than the presence of explicit markers. Further, the percentage of flips varies across ethnicities. Finally, we find that larger models are more robust. Our findings indicate the need for exercising caution in deploying LLMs for high-stakes tasks like hate speech detection.

Paper Structure

This paper contains 24 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: We investigate whether adding the identity of the speaker, whether Singaporean or Jamaican, can affect the model's hate speech classification on the same sentence. Our findings indicate that model outputs do flip because of the presence of such markers, and the percentage of flips depends on the marker, model size, and the ethnicity injected.
  • Figure 2: Percentage of flips in the prediction of different models when the original prediction is non-hateful (NH) or hateful (H) and the sentences are injected with different racial markers of the speaker either explicitly or implicitly. Flips from non-hateful to hateful (NH->H) correspond to the False Positive Rate (FPR) and from hateful to non-hateful (H->NH) correspond to False Negative Rate (FNR)
  • Figure 3: Percentage of Flips across each race against each Target group for implicitly marked models.
  • Figure 4: Prompt for Dialect Generation
  • Figure 5: Percentage of flips on the MPBHSD Dataset in the prediction of different models when the original prediction is non-hateful (NH) or hateful (H) and the sentences are injected with different racial markers of the speaker, either explicitly or implicitly.
  • ...and 1 more figures