Systematic Offensive Stereotyping (SOS) Bias in Language Models

Fatma Elsafoury

Systematic Offensive Stereotyping (SOS) Bias in Language Models

Fatma Elsafoury

TL;DR

This work tackles systematic offensive stereotyping (SOS) bias in language models by introducing an MLM-based $SOS_{LM}$ metric and a dedicated profane/non-profane sentence-pair dataset to quantify bias across six sensitive attributes. It validates the metric against established social-bias signals and online-hate data, and examines whether state-of-the-art debiasing (SentDebias) can remove SOS bias. The results show that all examined LMs exhibit SOS bias, with stronger effects for marginalized groups and meaningful links to online hate; debiasing yields mixed improvements across attributes and metrics. The study also assesses how SOS bias affects hate-speech detection fairness, finding evidence of fairness impact but no consistent performance degradation or improvement, underscoring nuanced interactions between bias removal, fairness, and downstream tasks. The authors provide the dataset and code publicly to foster further research in mitigating SOS bias in LMs.

Abstract

In this paper, we propose a new metric to measure the SOS bias in language models (LMs). Then, we validate the SOS bias and investigate the effectiveness of removing it. Finally, we investigate the impact of the SOS bias in LMs on their performance and fairness on hate speech detection. Our results suggest that all the inspected LMs are SOS biased. And that the SOS bias is reflective of the online hate experienced by marginalized identities. The results indicate that using debias methods from the literature worsens the SOS bias in LMs for some sensitive attributes and improves it for others. Finally, Our results suggest that the SOS bias in the inspected LMs has an impact on their fairness of hate speech detection. However, there is no strong evidence that the SOS bias has an impact on the performance of hate speech detection.

Systematic Offensive Stereotyping (SOS) Bias in Language Models

TL;DR

This work tackles systematic offensive stereotyping (SOS) bias in language models by introducing an MLM-based

metric and a dedicated profane/non-profane sentence-pair dataset to quantify bias across six sensitive attributes. It validates the metric against established social-bias signals and online-hate data, and examines whether state-of-the-art debiasing (SentDebias) can remove SOS bias. The results show that all examined LMs exhibit SOS bias, with stronger effects for marginalized groups and meaningful links to online hate; debiasing yields mixed improvements across attributes and metrics. The study also assesses how SOS bias affects hate-speech detection fairness, finding evidence of fairness impact but no consistent performance degradation or improvement, underscoring nuanced interactions between bias removal, fairness, and downstream tasks. The authors provide the dataset and code publicly to foster further research in mitigating SOS bias in LMs.

Abstract

Paper Structure (15 sections, 6 equations, 4 figures, 11 tables)

This paper contains 15 sections, 6 equations, 4 figures, 11 tables.

Introduction
Background
Measure SOS bias in LMs
$SOS_{LM}$ bias dataset
$SOS_{LM}$ bias metric
SOS biased LMs
SOS bias validation
SOS bias vs. social bias in LMs
SOS bias and online hate
SOS bias removal
Impact of SOS bias on the performance of hate speech detection
Impact of SOS bias on fairness of hate speech detection
Limitations
Conclusion
Appendix

Figures (4)

Figure 1: $SOS_{LM}$ bias scores in the different LMs against all identity groups (marginalized and non-marginalized).
Figure 2: Heatmap of the Pearson's correlation ($\rho$) between the $SOS_{LM}$ bias and social bias scores.
Figure 3: Heat-map of the Pearson's correlation ($\rho$) between the SOS bias scores measured using the $SOS_{LM}$ metric and the percentages of marginalized identities who experience online hate in different countries.
Figure 4: Heatmap of Pearson's correlation between social and SOS bias in LMs and fairness scores of LMs on the downstream task of hate speech detection.

Systematic Offensive Stereotyping (SOS) Bias in Language Models

TL;DR

Abstract

Systematic Offensive Stereotyping (SOS) Bias in Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)