Hate Speech Detection with Generalizable Target-aware Fairness

Tong Chen; Danny Wang; Xurong Liang; Marten Risius; Gianluca Demartini; Hongzhi Yin

Hate Speech Detection with Generalizable Target-aware Fairness

Tong Chen, Danny Wang, Xurong Liang, Marten Risius, Gianluca Demartini, Hongzhi Yin

TL;DR

This work tackles target-aware fairness in hate speech detection by addressing biases toward discussed identity groups, especially for unseen targets at inference. It introduces GetFair, a plug-in framework that replaces fixed target filters with a hypernetwork-generated, low-rank, target-specific filtering mechanism that operates on post embeddings to reduce reliance on target-related signals. GetFair employs adversarial training with a target discriminator and imitation learning, along with semantic gap alignment to regularize filter parameters, enabling generalization to new targets without retraining. Empirical results on two public datasets show GetFair achieves strong accuracy and AUC while significantly improving fairness across targets, demonstrating practical applicability for real-time, fair content moderation. The approach advances fairness in HSD by delivering scalable generalization and robust performance against unseen target groups.

Abstract

To counter the side effect brought by the proliferation of social media platforms, hate speech detection (HSD) plays a vital role in halting the dissemination of toxic online posts at an early stage. However, given the ubiquitous topical communities on social media, a trained HSD classifier easily becomes biased towards specific targeted groups (e.g., female and black people), where a high rate of false positive/negative results can significantly impair public trust in the fairness of content moderation mechanisms, and eventually harm the diversity of online society. Although existing fairness-aware HSD methods can smooth out some discrepancies across targeted groups, they are mostly specific to a narrow selection of targets that are assumed to be known and fixed. This inevitably prevents those methods from generalizing to real-world use cases where new targeted groups constantly emerge over time. To tackle this defect, we propose Generalizable target-aware Fairness (GetFair), a new method for fairly classifying each post that contains diverse and even unseen targets during inference. To remove the HSD classifier's spurious dependence on target-related features, GetFair trains a series of filter functions in an adversarial pipeline, so as to deceive the discriminator that recovers the targeted group from filtered post embeddings. To maintain scalability and generalizability, we innovatively parameterize all filter functions via a hypernetwork that is regularized by the semantic affinity among targets. Taking a target's pretrained word embedding as input, the hypernetwork generates the weights used by each target-specific filter on-the-fly without storing dedicated filter parameters. Finally, comparative experiments on two HSD datasets have shown advantageous performance of GetFair on out-of-sample targets.

Hate Speech Detection with Generalizable Target-aware Fairness

TL;DR

Abstract

Paper Structure (21 sections, 12 equations, 4 figures, 5 tables, 1 algorithm)

This paper contains 21 sections, 12 equations, 4 figures, 5 tables, 1 algorithm.

Introduction
Preliminaries
GetFair: Model Design
Target-specific Filter Generation with Adaptive Hypernetwork
Regularizing Filter Parameters with Semantic Gap Alignment
Target Discriminator
Hate Speech Classifier
Adversarial Optimization via Alternation
Experiments
Evaluation Datasets
Baselines and Metrics
Overall Performance (RQ1)
Ablation Study (RQ2)
Hyperparameter Sensitivity (RQ3)
Compatibility with Other Encoders (RQ4)
...and 6 more sections

Figures (4)

Figure 1: An overarching view of GetFair. Detailed designs of the four objective functions can be found in Sections \ref{['sec:L_reg']} ($\mathcal{L}_{reg}$), \ref{['sec:L_dis']} ($\mathcal{L}_{dis}$), and \ref{['sec:L_hate_imi']} ($\mathcal{L}_{hate}$ and $\mathcal{L}_{imi}$), respectively.
Figure 2: Overall performance visualization with both effectiveness and fairness considered. For each dataset, the mean performance is the average of both settings. The size of each scattered point is proportional to $\frac{mean\,\, F1}{mean\,\, HF}$.
Figure 3: The t-SNE van2008visualizing visualization of target indicators and the generated target-specific filter parameters. Targets that are only seen in the test set are annotated in red.
Figure 4: Analysis of the impact from key hyperparameters, with effectiveness and fairness metrics F1 and HF, respectively.

Hate Speech Detection with Generalizable Target-aware Fairness

TL;DR

Abstract

Hate Speech Detection with Generalizable Target-aware Fairness

Authors

TL;DR

Abstract

Table of Contents

Figures (4)