Table of Contents
Fetching ...

Bridging Fairness and Explainability: Can Input-Based Explanations Promote Fairness in Hate Speech Detection?

Yifan Wang, Mayank Jobanputra, Ji-Ung Lee, Soyoung Oh, Isabel Valera, Vera Demberg

TL;DR

This paper tackles bias and fairness in hate speech detection by conducting the first large-scale, quantitative study on how input-based explanations relate to fairness. It systematically evaluates both encoder- and decoder-based models, across two datasets, using three research questions: can explanations detect biased predictions, can they guide automatic model selection, and can they supervise bias mitigation during training. Findings show that input-based explanations effectively identify biased predictions and can supervise debiasing with favorable fairness–accuracy trade-offs, but they are not reliable for selecting the fairest models. Moreover, explanations remain useful for bias detection even after debiasing and typically outperform LLM-based judgments in bias assessment. The work provides practical guidance on which explanation methods best support bias detection and mitigation, and it releases code for reproducibility and further study.

Abstract

Natural language processing (NLP) models often replicate or amplify social bias from training data, raising concerns about fairness. At the same time, their black-box nature makes it difficult for users to recognize biased predictions and for developers to effectively mitigate them. While some studies suggest that input-based explanations can help detect and mitigate bias, others question their reliability in ensuring fairness. Existing research on explainability in fair NLP has been predominantly qualitative, with limited large-scale quantitative analysis. In this work, we conduct the first systematic study of the relationship between explainability and fairness in hate speech detection, focusing on both encoder- and decoder-only models. We examine three key dimensions: (1) identifying biased predictions, (2) selecting fair models, and (3) mitigating bias during model training. Our findings show that input-based explanations can effectively detect biased predictions and serve as useful supervision for reducing bias during training, but they are unreliable for selecting fair models among candidates.Our code is available at https://github.com/Ewanwong/fairness_x_explainability.

Bridging Fairness and Explainability: Can Input-Based Explanations Promote Fairness in Hate Speech Detection?

TL;DR

This paper tackles bias and fairness in hate speech detection by conducting the first large-scale, quantitative study on how input-based explanations relate to fairness. It systematically evaluates both encoder- and decoder-based models, across two datasets, using three research questions: can explanations detect biased predictions, can they guide automatic model selection, and can they supervise bias mitigation during training. Findings show that input-based explanations effectively identify biased predictions and can supervise debiasing with favorable fairness–accuracy trade-offs, but they are not reliable for selecting the fairest models. Moreover, explanations remain useful for bias detection even after debiasing and typically outperform LLM-based judgments in bias assessment. The work provides practical guidance on which explanation methods best support bias detection and mitigation, and it releases code for reproducibility and further study.

Abstract

Natural language processing (NLP) models often replicate or amplify social bias from training data, raising concerns about fairness. At the same time, their black-box nature makes it difficult for users to recognize biased predictions and for developers to effectively mitigate them. While some studies suggest that input-based explanations can help detect and mitigate bias, others question their reliability in ensuring fairness. Existing research on explainability in fair NLP has been predominantly qualitative, with limited large-scale quantitative analysis. In this work, we conduct the first systematic study of the relationship between explainability and fairness in hate speech detection, focusing on both encoder- and decoder-only models. We examine three key dimensions: (1) identifying biased predictions, (2) selecting fair models, and (3) mitigating bias during model training. Our findings show that input-based explanations can effectively detect biased predictions and serve as useful supervision for reducing bias during training, but they are unreliable for selecting fair models among candidates.Our code is available at https://github.com/Ewanwong/fairness_x_explainability.

Paper Structure

This paper contains 53 sections, 4 equations, 32 figures, 11 tables.

Figures (32)

  • Figure 1: Workflow diagram illustrating the processes used to address each research question. Sensitive tokens are shown in blue boxes, and the intensity of the green shading reflects each word’s contribution to the model’s prediction.
  • Figure 2: Fairness correlation results for each explanation method. Occlusion- and L2-based explanations are effective for bias detection across different bias types and models.
  • Figure 3: Rank correlations between validation set average absolute sensitive token reliance and test set individual unfairness. The validation set sizes are 500 for race and gender, and 200 for religion. None of the explanation methods consistently achieve performance on par with the baseline.
  • Figure 4: Average MRR@1 across bias types. Explanation methods perform worse than the baseline in identifying the fairest models.
  • Figure 5: Accuracy and fairness results for bias mitigation using different explanation methods. Each column corresponds to models selected by maximizing the fairness-balanced metric with respect to the indicated bias metric. We find that explanation methods can improve fairness across many metrics while maintaining reasonable task accuracy.
  • ...and 27 more figures