Aligning Attention with Human Rationales for Self-Explaining Hate Speech Detection

Brage Eilertsen; Røskva Bjørgfinsdóttir; Francielle Vargas; Ali Ramezani-Kebrya

Aligning Attention with Human Rationales for Self-Explaining Hate Speech Detection

Brage Eilertsen, Røskva Bjørgfinsdóttir, Francielle Vargas, Ali Ramezani-Kebrya

TL;DR

This work tackles the opacity of deep hate speech detectors by introducing Supervised Rational Attention (SRA), a framework that aligns transformer attention with human-provided rationales through a joint classification and attention-alignment objective. By incorporating a dedicated Attention Alignment Loss and dataset-specific rationale extraction, SRA yields token-level explanations that are more faithful and human-aligned, while preserving competitive accuracy and fairness across English and Portuguese benchmarks. The method demonstrates substantial explainability gains (IoU F1 and Token F1 improvements) and robust cross-lingual performance, with a nuanced trade-off among fairness metrics. Overall, SRA advances trustworthy AI in sensitive content moderation by delivering intrinsic explanations and maintaining strong detection capabilities, with practical implications for deployable, fair, and interpretable hate speech systems.

Abstract

The opaque nature of deep learning models presents significant challenges for the ethical deployment of hate speech detection systems. To address this limitation, we introduce Supervised Rational Attention (SRA), a framework that explicitly aligns model attention with human rationales, improving both interpretability and fairness in hate speech classification. SRA integrates a supervised attention mechanism into transformer-based classifiers, optimizing a joint objective that combines standard classification loss with an alignment loss term that minimizes the discrepancy between attention weights and human-annotated rationales. We evaluated SRA on hate speech benchmarks in English (HateXplain) and Portuguese (HateBRXplain) with rationale annotations. Empirically, SRA achieves 2.4x better explainability compared to current baselines, and produces token-level explanations that are more faithful and human-aligned. In terms of fairness, SRA achieves competitive fairness across all measures, with second-best performance in detecting toxic posts targeting identity groups, while maintaining comparable results on other metrics. These findings demonstrate that incorporating human rationales into attention mechanisms can enhance interpretability and faithfulness without compromising fairness.

Aligning Attention with Human Rationales for Self-Explaining Hate Speech Detection

TL;DR

Abstract

Aligning Attention with Human Rationales for Self-Explaining Hate Speech Detection

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)