Table of Contents
Fetching ...

Template-Based Probes Are Imperfect Lenses for Counterfactual Bias Evaluation in LLMs

Farnaz Kohankhaki, D. B. Emerson, Jacob-Junqi Tian, Laleh Seyyed-Kalantari, Faiza Khan Khattak

TL;DR

It is found that template-based probes can introduce systematic distortions in bias measurements, and the need for more rigorous methodologies in counterfactual bias evaluation is highlighted, ensuring that observed disparities reflect genuine biases rather than artifacts of linguistic conventions.

Abstract

Bias in large language models (LLMs) has many forms, from overt discrimination to implicit stereotypes. Counterfactual bias evaluation is a widely used approach to quantifying bias and often relies on template-based probes that explicitly state group membership. It aims to measure whether the outcome of a task performed by an LLM is invariant to a change in group membership. In this work, we find that template-based probes can introduce systematic distortions in bias measurements. Specifically, we consistently find that such probes suggest that LLMs classify text associated with White race as negative at disproportionately elevated rates. This is observed consistently across a large collection of LLMs, over several diverse template-based probes, and with different classification approaches. We hypothesize that this arises artificially due to linguistic asymmetries present in LLM pretraining data, in the form of markedness, (e.g., Black president vs. president) and templates used for bias measurement (e.g., Black president vs. White president). These findings highlight the need for more rigorous methodologies in counterfactual bias evaluation, ensuring that observed disparities reflect genuine biases rather than artifacts of linguistic conventions.

Template-Based Probes Are Imperfect Lenses for Counterfactual Bias Evaluation in LLMs

TL;DR

It is found that template-based probes can introduce systematic distortions in bias measurements, and the need for more rigorous methodologies in counterfactual bias evaluation is highlighted, ensuring that observed disparities reflect genuine biases rather than artifacts of linguistic conventions.

Abstract

Bias in large language models (LLMs) has many forms, from overt discrimination to implicit stereotypes. Counterfactual bias evaluation is a widely used approach to quantifying bias and often relies on template-based probes that explicitly state group membership. It aims to measure whether the outcome of a task performed by an LLM is invariant to a change in group membership. In this work, we find that template-based probes can introduce systematic distortions in bias measurements. Specifically, we consistently find that such probes suggest that LLMs classify text associated with White race as negative at disproportionately elevated rates. This is observed consistently across a large collection of LLMs, over several diverse template-based probes, and with different classification approaches. We hypothesize that this arises artificially due to linguistic asymmetries present in LLM pretraining data, in the form of markedness, (e.g., Black president vs. president) and templates used for bias measurement (e.g., Black president vs. White president). These findings highlight the need for more rigorous methodologies in counterfactual bias evaluation, ensuring that observed disparities reflect genuine biases rather than artifacts of linguistic conventions.
Paper Structure (26 sections, 1 equation, 6 figures, 5 tables)

This paper contains 26 sections, 1 equation, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Negative- and Positive-Sentiment FPR gaps as measured by the Amazon dataset.
  • Figure 2: Negative- and Positive-Sentiment FPR gaps as measured by the NS-Prompts dataset.
  • Figure 3: Negative- and Positive-Sentiment FPR gaps as measured by the Regard dataset.
  • Figure 4: Negative- and Positive-Sentiment FPR gaps as measured by the Amazon dataset with prompt-based classification. In the legend, model names without a suffix indicate zero-shot prompting. SST5 and SE indicate 9-shot prompts with examples drawn from the SST5 and SemEval datasets, respectively.
  • Figure 5: Negative- and Positive-Sentiment FPR gaps for protected group variations within the sensitive attribute of Sexuality as measured by the Amazon dataset. In the legend, model names without a suffix indicate zero-shot prompting. SST5 and SE indicate 9-shot prompts with examples drawn from the SST5 and SemEval datasets, respectively.
  • ...and 1 more figures