ID-XCB: Data-independent Debiasing for Fair and Accurate Transformer-based Cyberbullying Detection

Peiling Yi; Arkaitz Zubiaga

ID-XCB: Data-independent Debiasing for Fair and Accurate Transformer-based Cyberbullying Detection

Peiling Yi, Arkaitz Zubiaga

TL;DR

The paper tackles the problem of swear-word-driven bias in transformer-based cyberbullying detection by proposing ID-XCB, a data-independent debiasing framework that combines adversarial training, independent fairness constraints, and debias fine-tuning on contextual transformer embeddings. A hidden-states selector guides layer-wise fine-tuning to enhance generalisation to unseen data, while an independent validation set sets the fairness constraints using metrics such as $FPED$ and $FNED$, with EmbeddingLoss defined as $EmbeddingLoss = 1 - \cos(x_1, x_2)$. Empirical results on Instagram and Vine session-based datasets show that ID-XCB achieves competitive task performance and superior or comparable bias mitigation relative to state-of-the-art data-dependent debiasing methods, with robust cross-dataset generalisation and insightful ablations. The work provides a practical pathway to fair, accurate cyberbullying detection and contributes to the broader goal of bias-robust NLP systems across unseen data and platforms.

Abstract

Swear words are a common proxy to collect datasets with cyberbullying incidents. Our focus is on measuring and mitigating biases derived from spurious associations between swear words and incidents occurring as a result of such data collection strategies. After demonstrating and quantifying these biases, we introduce ID-XCB, the first data-independent debiasing technique that combines adversarial training, bias constraints and debias fine-tuning approach aimed at alleviating model attention to bias-inducing words without impacting overall model performance. We explore ID-XCB on two popular session-based cyberbullying datasets along with comprehensive ablation and generalisation studies. We show that ID-XCB learns robust cyberbullying detection capabilities while mitigating biases, outperforming state-of-the-art debiasing methods in both performance and bias mitigation. Our quantitative and qualitative analyses demonstrate its generalisability to unseen data.

ID-XCB: Data-independent Debiasing for Fair and Accurate Transformer-based Cyberbullying Detection

TL;DR

and

, with EmbeddingLoss defined as

. Empirical results on Instagram and Vine session-based datasets show that ID-XCB achieves competitive task performance and superior or comparable bias mitigation relative to state-of-the-art data-dependent debiasing methods, with robust cross-dataset generalisation and insightful ablations. The work provides a practical pathway to fair, accurate cyberbullying detection and contributes to the broader goal of bias-robust NLP systems across unseen data and platforms.

Abstract

Paper Structure (27 sections, 6 equations, 7 figures, 7 tables, 1 algorithm)

This paper contains 27 sections, 6 equations, 7 figures, 7 tables, 1 algorithm.

Introduction
Related Work
Datasets and Lexicon
Swear Word Bias
Distribution of swear words
Measuring swear word bias
Quantifying bias with transformers
ID-XCB
Training loss functions
Adversarial training
Task training
Fairness constraints
Constraint-based classifier
Joint training
Experiment Settings
...and 12 more sections

Figures (7)

Figure 1: A snippet of false positive sample in Instagram
Figure 2: Data-dependent bias constraints vs. ID-XCB.
Figure 3: Architecture of ID-XCB.
Figure 4: Swear word bias in RoBERTa vs ID-XCB$_{RoBERTa}$ for cross-platform experiments.
Figure 5: Impact of constraint weighting. X axis for constraint weights ($\beta$) and Y axis for F1 score.
...and 2 more figures

ID-XCB: Data-independent Debiasing for Fair and Accurate Transformer-based Cyberbullying Detection

TL;DR

Abstract

ID-XCB: Data-independent Debiasing for Fair and Accurate Transformer-based Cyberbullying Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (7)