Table of Contents
Fetching ...

Evaluating Gender Bias Transfer between Pre-trained and Prompt-Adapted Language Models

Natalie Mackraz, Nivedha Sivakumar, Samira Khorshidi, Krishna Patel, Barry-John Theobald, Luca Zappella, Nicholas Apostoloff

TL;DR

This work investigates whether biases present in pre-trained causal language models transfer to downstream tasks when models are adapted via prompt-based prompting. It introduces Selection Bias as a unified metric and analyzes correlations between intrinsic biases and those elicited by zero-shot and few-shot prompts across multiple models on the WinoBias pronoun-resolution task. The key finding is a strong, robust transfer: biases in pre-training persist under prompting, and pre-prompting for fairness or bias does not eliminate the transfer, even as few-shot compositions vary. The study highlights the importance of ensuring pre-trained model fairness for safe deployment and motivates extending evaluations to other adaptation strategies beyond prompting.

Abstract

Large language models (LLMs) are increasingly being adapted to achieve task-specificity for deployment in real-world decision systems. Several previous works have investigated the bias transfer hypothesis (BTH) by studying the effect of the fine-tuning adaptation strategy on model fairness to find that fairness in pre-trained masked language models have limited effect on the fairness of models when adapted using fine-tuning. In this work, we expand the study of BTH to causal models under prompt adaptations, as prompting is an accessible, and compute-efficient way to deploy models in real-world systems. In contrast to previous works, we establish that intrinsic biases in pre-trained Mistral, Falcon and Llama models are strongly correlated (rho >= 0.94) with biases when the same models are zero- and few-shot prompted, using a pronoun co-reference resolution task. Further, we find that bias transfer remains strongly correlated even when LLMs are specifically prompted to exhibit fair or biased behavior (rho >= 0.92), and few-shot length and stereotypical composition are varied (rho >= 0.97). Our findings highlight the importance of ensuring fairness in pre-trained LLMs, especially when they are later used to perform downstream tasks via prompt adaptation.

Evaluating Gender Bias Transfer between Pre-trained and Prompt-Adapted Language Models

TL;DR

This work investigates whether biases present in pre-trained causal language models transfer to downstream tasks when models are adapted via prompt-based prompting. It introduces Selection Bias as a unified metric and analyzes correlations between intrinsic biases and those elicited by zero-shot and few-shot prompts across multiple models on the WinoBias pronoun-resolution task. The key finding is a strong, robust transfer: biases in pre-training persist under prompting, and pre-prompting for fairness or bias does not eliminate the transfer, even as few-shot compositions vary. The study highlights the importance of ensuring pre-trained model fairness for safe deployment and motivates extending evaluations to other adaptation strategies beyond prompting.

Abstract

Large language models (LLMs) are increasingly being adapted to achieve task-specificity for deployment in real-world decision systems. Several previous works have investigated the bias transfer hypothesis (BTH) by studying the effect of the fine-tuning adaptation strategy on model fairness to find that fairness in pre-trained masked language models have limited effect on the fairness of models when adapted using fine-tuning. In this work, we expand the study of BTH to causal models under prompt adaptations, as prompting is an accessible, and compute-efficient way to deploy models in real-world systems. In contrast to previous works, we establish that intrinsic biases in pre-trained Mistral, Falcon and Llama models are strongly correlated (rho >= 0.94) with biases when the same models are zero- and few-shot prompted, using a pronoun co-reference resolution task. Further, we find that bias transfer remains strongly correlated even when LLMs are specifically prompted to exhibit fair or biased behavior (rho >= 0.92), and few-shot length and stereotypical composition are varied (rho >= 0.97). Our findings highlight the importance of ensuring fairness in pre-trained LLMs, especially when they are later used to perform downstream tasks via prompt adaptation.

Paper Structure

This paper contains 16 sections, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Text formatting on a hand-crafted sample (top left) for intrinsic generation (middle left), zero-shot prompting (bottom left) and few-shot prompting (right).
  • Figure 2: Bias (SB) of Llama 3 8B presented by adaptation and task type. These figures are best viewed in color.
  • Figure 3: Correlation of selection biases in occupations between: intrinsic and zero-shot adaptations (top) and intrinsic and few-shot adaptations (bottom). All results are strongly correlated with $\rho \geq 0.94$ and $p \approx 0$. Best viewed in color.
  • Figure 4: Selection bias (SB) for Llama 3 8B by varying number of, and stereotype (anti- or pro-stereotypical) in, few-shot samples. Ambiguous sentences always results in worse biases than non ambiguous sentences, and increasing number of anti-stereotypical samples incrementally worsens SB. This figure is best viewed in color.
  • Figure 5: Selection bias by occupation and WinoBias task type in Llama 3 8B when intrinsically, zero- and few-shot adapted. Fair is ideally zero; less than zero is female-biased and greater than zero is male-biased. Results are aggregated over 5 random seeds; standard deviation is overlaid on each bar in black.
  • ...and 5 more figures