Bridging Fairness and Explainability: Can Input-Based Explanations Promote Fairness in Hate Speech Detection?
Yifan Wang, Mayank Jobanputra, Ji-Ung Lee, Soyoung Oh, Isabel Valera, Vera Demberg
TL;DR
This paper tackles bias and fairness in hate speech detection by conducting the first large-scale, quantitative study on how input-based explanations relate to fairness. It systematically evaluates both encoder- and decoder-based models, across two datasets, using three research questions: can explanations detect biased predictions, can they guide automatic model selection, and can they supervise bias mitigation during training. Findings show that input-based explanations effectively identify biased predictions and can supervise debiasing with favorable fairness–accuracy trade-offs, but they are not reliable for selecting the fairest models. Moreover, explanations remain useful for bias detection even after debiasing and typically outperform LLM-based judgments in bias assessment. The work provides practical guidance on which explanation methods best support bias detection and mitigation, and it releases code for reproducibility and further study.
Abstract
Natural language processing (NLP) models often replicate or amplify social bias from training data, raising concerns about fairness. At the same time, their black-box nature makes it difficult for users to recognize biased predictions and for developers to effectively mitigate them. While some studies suggest that input-based explanations can help detect and mitigate bias, others question their reliability in ensuring fairness. Existing research on explainability in fair NLP has been predominantly qualitative, with limited large-scale quantitative analysis. In this work, we conduct the first systematic study of the relationship between explainability and fairness in hate speech detection, focusing on both encoder- and decoder-only models. We examine three key dimensions: (1) identifying biased predictions, (2) selecting fair models, and (3) mitigating bias during model training. Our findings show that input-based explanations can effectively detect biased predictions and serve as useful supervision for reducing bias during training, but they are unreliable for selecting fair models among candidates.Our code is available at https://github.com/Ewanwong/fairness_x_explainability.
