Table of Contents
Fetching ...

Inference-Time Selective Debiasing to Enhance Fairness in Text Classification Models

Gleb Kuzmin, Neemesh Yadav, Ivan Smirnov, Timothy Baldwin, Artem Shelmanov

TL;DR

This work tackles the challenge of improving fairness in text classification when full model retraining is not feasible. It introduces selective debiasing, an inference-time mechanism that applies post-processing debiasing only to predictions deemed biased by a KL-divergence–based bias score, thus balancing performance and fairness. By combining LEACE post-processing with a KL-based selection, the approach achieves competitive fairness–performance trade-offs and narrows the gap between post-processing and retraining-based debiasing. The method is computationally lightweight and applicable to encoder-based models on datasets with explicit protected attributes, offering a practical path to safer NLP systems in constrained settings.

Abstract

We propose selective debiasing -- an inference-time safety mechanism designed to enhance the overall model quality in terms of prediction performance and fairness, especially in scenarios where retraining the model is impractical. The method draws inspiration from selective classification, where at inference time, predictions with low quality, as indicated by their uncertainty scores, are discarded. In our approach, we identify the potentially biased model predictions and, instead of discarding them, we remove bias from these predictions using LEACE -- a post-processing debiasing method. To select problematic predictions, we propose a bias quantification approach based on KL divergence, which achieves better results than standard uncertainty quantification methods. Experiments on text classification datasets with encoder-based classification models demonstrate that selective debiasing helps to reduce the performance gap between post-processing methods and debiasing techniques from the at-training and pre-processing categories.

Inference-Time Selective Debiasing to Enhance Fairness in Text Classification Models

TL;DR

This work tackles the challenge of improving fairness in text classification when full model retraining is not feasible. It introduces selective debiasing, an inference-time mechanism that applies post-processing debiasing only to predictions deemed biased by a KL-divergence–based bias score, thus balancing performance and fairness. By combining LEACE post-processing with a KL-based selection, the approach achieves competitive fairness–performance trade-offs and narrows the gap between post-processing and retraining-based debiasing. The method is computationally lightweight and applicable to encoder-based models on datasets with explicit protected attributes, offering a practical path to safer NLP systems in constrained settings.

Abstract

We propose selective debiasing -- an inference-time safety mechanism designed to enhance the overall model quality in terms of prediction performance and fairness, especially in scenarios where retraining the model is impractical. The method draws inspiration from selective classification, where at inference time, predictions with low quality, as indicated by their uncertainty scores, are discarded. In our approach, we identify the potentially biased model predictions and, instead of discarding them, we remove bias from these predictions using LEACE -- a post-processing debiasing method. To select problematic predictions, we propose a bias quantification approach based on KL divergence, which achieves better results than standard uncertainty quantification methods. Experiments on text classification datasets with encoder-based classification models demonstrate that selective debiasing helps to reduce the performance gap between post-processing methods and debiasing techniques from the at-training and pre-processing categories.
Paper Structure (24 sections, 8 equations, 1 figure, 20 tables)

This paper contains 24 sections, 8 equations, 1 figure, 20 tables.

Figures (1)

  • Figure 1: Rejection results for fairness and accuracy with oracle scores on a synthetic dataset with a LogReg model; the FR-AUC and Acc-AUC are the areas under fairness-- and accuracy--rejection curves correspondingly. The details are presented in \ref{['sec:fairness_oracle']}.