Leveraging Prototypical Representations for Mitigating Social Bias without Demographic Information
Shadi Iskander, Kira Radinsky, Yonatan Belinkov
TL;DR
DaFair addresses social bias in language models without relying on demographic labels by leveraging prototypical demographic texts and a KL divergence regularization during fine-tuning. It defines multiple social-attribute representations, uses an ensemble of representation pairs, and optimizes a total loss $L_{total} = L_{ce} + \lambda L_{kl}$ to encourage uniform similarity to demographic prototypes. The approach supports no-label and limited-label settings (Semi-DaFair) and demonstrates bias reduction on occupation prediction and Twitter sentiment tasks across BERT and DeBERTa-V3, with modest accuracy trade-offs. This work offers a scalable, practical framework for fairness in NLP with clear pathways for extension to other bias types while noting ethical considerations and limitations of predefined texts and binary gender focus.
Abstract
Mitigating social biases typically requires identifying the social groups associated with each data sample. In this paper, we present DAFair, a novel approach to address social bias in language models. Unlike traditional methods that rely on explicit demographic labels, our approach does not require any such information. Instead, we leverage predefined prototypical demographic texts and incorporate a regularization term during the fine-tuning process to mitigate bias in the model's representations. Our empirical results across two tasks and two models demonstrate the effectiveness of our method compared to previous approaches that do not rely on labeled data. Moreover, with limited demographic-annotated data, our approach outperforms common debiasing approaches.
