Why Don't Prompt-Based Fairness Metrics Correlate?

Abdelrahman Zayed; Goncalo Mordido; Ioana Baldini; Sarath Chandar

Why Don't Prompt-Based Fairness Metrics Correlate?

Abdelrahman Zayed, Goncalo Mordido, Ioana Baldini, Sarath Chandar

TL;DR

This work investigates the reliability of prompt-based fairness metrics for evaluating social biases in large language models, focusing on gender and religion biases. It shows that leading prompt-based metrics (BOLD, HolisticBias, HONEST) exhibit poor cross-metric agreement. The authors introduce CAIRO, a data-augmentation framework that paraphrases prompts with multiple large-language-model paraphrasers, constructs prompt combinations, and selects those that maximize Pearson correlation across metrics. Empirical results across ten models and three metrics demonstrate substantial gains in cross-metric correlation (e.g., from $r=0.30$ to $r=0.90$ for gender and from $r=0.18$ to $r=0.98$ for religion) with significant $p$-values, suggesting a more reliable bias assessment pipeline; the code is publicly available.

Abstract

The widespread use of large language models has brought up essential questions about the potential biases these models might learn. This led to the development of several metrics aimed at evaluating and mitigating these biases. In this paper, we first demonstrate that prompt-based fairness metrics exhibit poor agreement, as measured by correlation, raising important questions about the reliability of fairness assessment using prompts. Then, we outline six relevant reasons why such a low correlation is observed across existing metrics. Based on these insights, we propose a method called Correlated Fairness Output (CAIRO) to enhance the correlation between fairness metrics. CAIRO augments the original prompts of a given fairness metric by using several pre-trained language models and then selects the combination of the augmented prompts that achieves the highest correlation across metrics. We show a significant improvement in Pearson correlation from 0.3 and 0.18 to 0.90 and 0.98 across metrics for gender and religion biases, respectively. Our code is available at https://github.com/chandar-lab/CAIRO.

Why Don't Prompt-Based Fairness Metrics Correlate?

TL;DR

for gender and from

for religion) with significant

-values, suggesting a more reliable bias assessment pipeline; the code is publicly available.

Abstract

Paper Structure (39 sections, 1 equation, 12 figures, 3 tables, 1 algorithm)

This paper contains 39 sections, 1 equation, 12 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Embedding-based fairness metrics
Probability-based fairness metrics
Prompt-based fairness metrics
Background
Bias Quantification
Paraphrasing
Correlation between prompt-based fairness metrics
Why should prompt-based fairness metrics correlate?
Why don’t prompt-based fairness metrics correlate?
Prompt sentence structure
Prompt verbalization
Prompt distribution
Bias quantification in each metric
...and 24 more sections

Figures (12)

Figure 1: Correlated fairness between fairness metrics on gender and religion bias with and without CAIRO.
Figure 2: Changing the sentence structure and verbalization of the original prompts of BOLD using paraphrases from Mistral leads to significant changes in religion bias.
Figure 3: Changing the prompt-augmentation model to generate the paraphrases influences religion bias, as measured by BOLD.
Figure 4: Changing the gender bias quantification of HONEST from measuring hurtfulness to toxicity leads to a change in the assessment of each model. The bias values are normalized.
Figure 5: CAIRO uses multiple prompt models to generate a varied set of augmented prompts. Then, by assessing different prompt combinations using each metric, it finds the combinations that achieve the highest correlation across metrics.
...and 7 more figures

Why Don't Prompt-Based Fairness Metrics Correlate?

TL;DR

Abstract

Why Don't Prompt-Based Fairness Metrics Correlate?

Authors

TL;DR

Abstract

Table of Contents

Figures (12)