Bias after Prompting: Persistent Discrimination in Large Language Models
Nivedha Sivakumar, Natalie Mackraz, Samira Khorshidi, Krishna Patel, Barry-John Theobald, Luca Zappella, Nicholas Apostoloff
TL;DR
The paper investigates bias transfer from pre-trained intrinsic biases to downstream prompt-adapted causal LLMs, introducing unified Selection Bias metrics to compare intrinsic and extrinsic biases across zero-, few-shot, and Chain-of-Thought prompting. It demonstrates strong bias transfer across demographics and tasks (e.g., gender in co-reference with ρ ≥ 0.94; age and religion in QA with ρ ≥ 0.98, 0.69 respectively) and shows this transfer persists under varied few-shot compositions. A broad set of prompt-based debiasing strategies fails to consistently prevent transfer across models and demographics, implying that mitigating intrinsic biases before prompting is crucial, though improvements in reasoning can aid debiasing. The findings highlight the practical importance of fair pre-trained models and suggest directions like attention-level interventions as potential partial mitigations, with significant implications for real-world deployment of prompt-based LLM systems.
Abstract
A dangerous assumption that can be made from prior work on the bias transfer hypothesis (BTH) is that biases do not transfer from pre-trained large language models (LLMs) to adapted models. We invalidate this assumption by studying the BTH in causal models under prompt adaptations, as prompting is an extremely popular and accessible adaptation strategy used in real-world applications. In contrast to prior work, we find that biases can transfer through prompting and that popular prompt-based mitigation methods do not consistently prevent biases from transferring. Specifically, the correlation between intrinsic biases and those after prompt adaptation remain moderate to strong across demographics and tasks -- for example, gender (rho >= 0.94) in co-reference resolution, and age (rho >= 0.98) and religion (rho >= 0.69) in question answering. Further, we find that biases remain strongly correlated when varying few-shot composition parameters, such as sample size, stereotypical content, occupational distribution and representational balance (rho >= 0.90). We evaluate several prompt-based debiasing strategies and find that different approaches have distinct strengths, but none consistently reduce bias transfer across models, tasks or demographics. These results demonstrate that correcting bias, and potentially improving reasoning ability, in intrinsic models may prevent propagation of biases to downstream tasks.
