Keeping Up with the Language Models: Systematic Benchmark Extension for Bias Auditing
Ioana Baldini, Chhavi Yadav, Manish Nagireddy, Payel Das, Kush R. Varshney
TL;DR
This paper tackles bias auditing in language models by addressing benchmark obsolescence and model brittleness through a systematic extension of the BBNLI bias benchmark. It introduces BBNLI-next, a 12.86K-sample NLI bias dataset generated via LM-driven lexical variations, adversarial filtering, manual validation, and counterfactual expansion, shown to be significantly more challenging than the original BBNLI (average accuracy drop from 95.3% to 57.5%). The authors critique aggregate bias scores and propose disaggregate counterfactual measures to disentangle bias from brittleness, demonstrating the robustness–bias interplay across multiple LMs and even uncovering biases in open-source generative LMs under varied prompts. The work also shows how leveraging models to augment datasets can keep bias auditing up to date without fine-tuning, while acknowledging construct-validity and scope limitations (English, US-centric, gender binary) and outlining avenues for broader bias domains and contexts.
Abstract
Bias auditing of language models (LMs) has received considerable attention as LMs are becoming widespread. As such, several benchmarks for bias auditing have been proposed. At the same time, the rapid evolution of LMs can make these benchmarks irrelevant in no time. Bias auditing is further complicated by LM brittleness: when a presumably biased outcome is observed, is it due to model bias or model brittleness? We propose enlisting the models themselves to help construct bias auditing datasets that remain challenging, and introduce bias measures that distinguish between different types of model errors. First, we extend an existing bias benchmark for NLI (BBNLI) using a combination of LM-generated lexical variations, adversarial filtering, and human validation. We demonstrate that the newly created dataset BBNLI-next is more challenging than BBNLI: on average, BBNLI-next reduces the accuracy of state-of-the-art NLI models from 95.3%, as observed by BBNLI, to a strikingly low 57.5%. Second, we employ BBNLI-next to showcase the interplay between robustness and bias: we point out shortcomings in current bias scores and propose bias measures that take into account both bias and model brittleness. Third, despite the fact that BBNLI-next was designed with non-generative models in mind, we show that the new dataset is also able to uncover bias in state-of-the-art open-source generative LMs. Note: All datasets included in this work are in English and they address US-centered social biases. In the spirit of efficient NLP research, no model training or fine-tuning was performed to conduct this research. Warning: This paper contains offensive text examples.
