Keeping Up with the Language Models: Systematic Benchmark Extension for Bias Auditing

Ioana Baldini; Chhavi Yadav; Manish Nagireddy; Payel Das; Kush R. Varshney

Keeping Up with the Language Models: Systematic Benchmark Extension for Bias Auditing

Ioana Baldini, Chhavi Yadav, Manish Nagireddy, Payel Das, Kush R. Varshney

TL;DR

This paper tackles bias auditing in language models by addressing benchmark obsolescence and model brittleness through a systematic extension of the BBNLI bias benchmark. It introduces BBNLI-next, a 12.86K-sample NLI bias dataset generated via LM-driven lexical variations, adversarial filtering, manual validation, and counterfactual expansion, shown to be significantly more challenging than the original BBNLI (average accuracy drop from 95.3% to 57.5%). The authors critique aggregate bias scores and propose disaggregate counterfactual measures to disentangle bias from brittleness, demonstrating the robustness–bias interplay across multiple LMs and even uncovering biases in open-source generative LMs under varied prompts. The work also shows how leveraging models to augment datasets can keep bias auditing up to date without fine-tuning, while acknowledging construct-validity and scope limitations (English, US-centric, gender binary) and outlining avenues for broader bias domains and contexts.

Abstract

Bias auditing of language models (LMs) has received considerable attention as LMs are becoming widespread. As such, several benchmarks for bias auditing have been proposed. At the same time, the rapid evolution of LMs can make these benchmarks irrelevant in no time. Bias auditing is further complicated by LM brittleness: when a presumably biased outcome is observed, is it due to model bias or model brittleness? We propose enlisting the models themselves to help construct bias auditing datasets that remain challenging, and introduce bias measures that distinguish between different types of model errors. First, we extend an existing bias benchmark for NLI (BBNLI) using a combination of LM-generated lexical variations, adversarial filtering, and human validation. We demonstrate that the newly created dataset BBNLI-next is more challenging than BBNLI: on average, BBNLI-next reduces the accuracy of state-of-the-art NLI models from 95.3%, as observed by BBNLI, to a strikingly low 57.5%. Second, we employ BBNLI-next to showcase the interplay between robustness and bias: we point out shortcomings in current bias scores and propose bias measures that take into account both bias and model brittleness. Third, despite the fact that BBNLI-next was designed with non-generative models in mind, we show that the new dataset is also able to uncover bias in state-of-the-art open-source generative LMs. Note: All datasets included in this work are in English and they address US-centered social biases. In the spirit of efficient NLP research, no model training or fine-tuning was performed to conduct this research. Warning: This paper contains offensive text examples.

Keeping Up with the Language Models: Systematic Benchmark Extension for Bias Auditing

TL;DR

Abstract

Paper Structure (25 sections, 1 equation, 4 figures, 17 tables)

This paper contains 25 sections, 1 equation, 4 figures, 17 tables.

Introduction
Social bias auditing in language models: background and related work
Bias Auditing with BBNLI-next
BBNLI
Systematic Benchmark Extension
BBNLI-next vs BBNLI: Accuracy
BBNLI-next vs BBNLI: Aggregate Bias Score
Disaggregate Counterfactual Measures
BBNLI-next and Generative Models
Conclusions
Appendix
Ethics Considerations
Limitations
Language Models Used in the Study
BBNLI-next Dataset Statistics
...and 10 more sections

Figures (4)

Figure 1: Lexical variations in the original BBNLI hypotheses change model predictions from neutral to entailment/contradiction, which uncovers model bias, while the original samples do not.
Figure 2: Machine generated hypotheses that lead to mispredictions irrespective of the social group. We argue that this type of mispredictions are not due to bias, but due to model brittleness.
Figure 3: BBNLI-next: Accuracy across models and split on bias domains; for comparison, the first column represents original BBNLI accuracy.
Figure 4: An illustration of the group-counterfactual hypothesis generation.

Keeping Up with the Language Models: Systematic Benchmark Extension for Bias Auditing

TL;DR

Abstract

Keeping Up with the Language Models: Systematic Benchmark Extension for Bias Auditing

Authors

TL;DR

Abstract

Table of Contents

Figures (4)