Systematic Offensive Stereotyping (SOS) Bias in Language Models
Fatma Elsafoury
TL;DR
This work tackles systematic offensive stereotyping (SOS) bias in language models by introducing an MLM-based $SOS_{LM}$ metric and a dedicated profane/non-profane sentence-pair dataset to quantify bias across six sensitive attributes. It validates the metric against established social-bias signals and online-hate data, and examines whether state-of-the-art debiasing (SentDebias) can remove SOS bias. The results show that all examined LMs exhibit SOS bias, with stronger effects for marginalized groups and meaningful links to online hate; debiasing yields mixed improvements across attributes and metrics. The study also assesses how SOS bias affects hate-speech detection fairness, finding evidence of fairness impact but no consistent performance degradation or improvement, underscoring nuanced interactions between bias removal, fairness, and downstream tasks. The authors provide the dataset and code publicly to foster further research in mitigating SOS bias in LMs.
Abstract
In this paper, we propose a new metric to measure the SOS bias in language models (LMs). Then, we validate the SOS bias and investigate the effectiveness of removing it. Finally, we investigate the impact of the SOS bias in LMs on their performance and fairness on hate speech detection. Our results suggest that all the inspected LMs are SOS biased. And that the SOS bias is reflective of the online hate experienced by marginalized identities. The results indicate that using debias methods from the literature worsens the SOS bias in LMs for some sensitive attributes and improves it for others. Finally, Our results suggest that the SOS bias in the inspected LMs has an impact on their fairness of hate speech detection. However, there is no strong evidence that the SOS bias has an impact on the performance of hate speech detection.
