Table of Contents
Fetching ...

Bias after Prompting: Persistent Discrimination in Large Language Models

Nivedha Sivakumar, Natalie Mackraz, Samira Khorshidi, Krishna Patel, Barry-John Theobald, Luca Zappella, Nicholas Apostoloff

TL;DR

The paper investigates bias transfer from pre-trained intrinsic biases to downstream prompt-adapted causal LLMs, introducing unified Selection Bias metrics to compare intrinsic and extrinsic biases across zero-, few-shot, and Chain-of-Thought prompting. It demonstrates strong bias transfer across demographics and tasks (e.g., gender in co-reference with ρ ≥ 0.94; age and religion in QA with ρ ≥ 0.98, 0.69 respectively) and shows this transfer persists under varied few-shot compositions. A broad set of prompt-based debiasing strategies fails to consistently prevent transfer across models and demographics, implying that mitigating intrinsic biases before prompting is crucial, though improvements in reasoning can aid debiasing. The findings highlight the practical importance of fair pre-trained models and suggest directions like attention-level interventions as potential partial mitigations, with significant implications for real-world deployment of prompt-based LLM systems.

Abstract

A dangerous assumption that can be made from prior work on the bias transfer hypothesis (BTH) is that biases do not transfer from pre-trained large language models (LLMs) to adapted models. We invalidate this assumption by studying the BTH in causal models under prompt adaptations, as prompting is an extremely popular and accessible adaptation strategy used in real-world applications. In contrast to prior work, we find that biases can transfer through prompting and that popular prompt-based mitigation methods do not consistently prevent biases from transferring. Specifically, the correlation between intrinsic biases and those after prompt adaptation remain moderate to strong across demographics and tasks -- for example, gender (rho >= 0.94) in co-reference resolution, and age (rho >= 0.98) and religion (rho >= 0.69) in question answering. Further, we find that biases remain strongly correlated when varying few-shot composition parameters, such as sample size, stereotypical content, occupational distribution and representational balance (rho >= 0.90). We evaluate several prompt-based debiasing strategies and find that different approaches have distinct strengths, but none consistently reduce bias transfer across models, tasks or demographics. These results demonstrate that correcting bias, and potentially improving reasoning ability, in intrinsic models may prevent propagation of biases to downstream tasks.

Bias after Prompting: Persistent Discrimination in Large Language Models

TL;DR

The paper investigates bias transfer from pre-trained intrinsic biases to downstream prompt-adapted causal LLMs, introducing unified Selection Bias metrics to compare intrinsic and extrinsic biases across zero-, few-shot, and Chain-of-Thought prompting. It demonstrates strong bias transfer across demographics and tasks (e.g., gender in co-reference with ρ ≥ 0.94; age and religion in QA with ρ ≥ 0.98, 0.69 respectively) and shows this transfer persists under varied few-shot compositions. A broad set of prompt-based debiasing strategies fails to consistently prevent transfer across models and demographics, implying that mitigating intrinsic biases before prompting is crucial, though improvements in reasoning can aid debiasing. The findings highlight the practical importance of fair pre-trained models and suggest directions like attention-level interventions as potential partial mitigations, with significant implications for real-world deployment of prompt-based LLM systems.

Abstract

A dangerous assumption that can be made from prior work on the bias transfer hypothesis (BTH) is that biases do not transfer from pre-trained large language models (LLMs) to adapted models. We invalidate this assumption by studying the BTH in causal models under prompt adaptations, as prompting is an extremely popular and accessible adaptation strategy used in real-world applications. In contrast to prior work, we find that biases can transfer through prompting and that popular prompt-based mitigation methods do not consistently prevent biases from transferring. Specifically, the correlation between intrinsic biases and those after prompt adaptation remain moderate to strong across demographics and tasks -- for example, gender (rho >= 0.94) in co-reference resolution, and age (rho >= 0.98) and religion (rho >= 0.69) in question answering. Further, we find that biases remain strongly correlated when varying few-shot composition parameters, such as sample size, stereotypical content, occupational distribution and representational balance (rho >= 0.90). We evaluate several prompt-based debiasing strategies and find that different approaches have distinct strengths, but none consistently reduce bias transfer across models, tasks or demographics. These results demonstrate that correcting bias, and potentially improving reasoning ability, in intrinsic models may prevent propagation of biases to downstream tasks.

Paper Structure

This paper contains 25 sections, 13 figures, 14 tables.

Figures (13)

  • Figure 1: Correlation of occupation selection biases (O-SB) between intrinsic and prompt (zero- and few-shot) adaptations. Each point is the O-SB for a single occupation, model, and experimental random seed; for each model, correlation is computed across 40 occupations and 5 random seeds. All models exhibit strong bias transfer upon prompting, with $\rho \geq 0.94$ and $p \approx 0$.
  • Figure 2: Prompt formatting on a hand-crafted sample (top) for intrinsic generation (middle), and zero-shot prompting (bottom). Few-shot prompting contains 3 in-context samples unless otherwise specified (see App. \ref{['sec:few-shot-neutral-prompts']}), followed by a query prompt to the model. Prompting options are randomly sorted.
  • Figure 3: Bias (O-SB) in Llama 3 8B when upon adaptation and aggregated over 5 random seeds. Bias of zero is fair; negative values indicate female bias, and positive values indicate male bias. Standard deviation is overlaid on each bar in black (intrinsic has no standard deviation as greedy-decoded has no stochasticity).
  • Figure 4: O-SB split by WinoBias ambiguity in Llama 3 8B when adapted with 100 anti-stereotypical prompts with occupations sampled proportional to Llama 3 8B's O-SB in Fig. \ref{['fig:occ_results_by_type']}. In contrast to Fig. \ref{['fig:occ_results_by_type']}, Type 2 split oftentimes flips in their bias orientation, and Type 1 split produces lower magnitude of bias.
  • Figure 5: Neutral three-shot prompt context containing one non-ambiguous sentence with a pro-stereotypical pronoun to the referent occupation, one non-ambiguous sentence a pro-stereotypical pronoun to the referent occupation, and one ambiguous sentence with "Unknown" as the right answer. To assess fairness in the 3-shot setting, this context will appear before each sentence WinoBias dataset formatted as a multiple-choice question. Option ordering is random.
  • ...and 8 more figures