Bias and Volatility: A Statistical Framework for Evaluating Large Language Model's Stereotypes and the Associated Generation Inconsistency
Yiran Liu, Ke Yang, Zehan Qi, Xiao Liu, Yang Yu, ChengXiang Zhai
TL;DR
The paper introduces the Bias-Volatility Framework (BVF), a statistical approach to quantify both mean bias and context-driven generation volatility in large language models (LLMs). BVF defines a stereotype distribution across contexts, a discrimination risk criterion $J$, and a clear bias–volatility decomposition ($R^b$, $R^v$) to attribute discrimination to systematic bias or generation inconsistency. Applying BVF to 12 LLMs, the study finds bias risk to be the dominant contributor to discrimination, with widespread pro-male stereotypes across occupations; RLHF reduces overall discrimination risk but increases volatility, and discrimination risk correlates with socio-economic factors like salaries. The framework supports automatic context collection, enables cross-model comparisons, and is generalizable to other parametric biases and modalities, offering a principled tool for fairness auditing and risk management in real-world deployments.
Abstract
We present a novel statistical framework for analyzing stereotypes in large language models (LLMs) by systematically estimating the bias and variation in their generation. Current alignment evaluation metrics often overlook stereotypes' randomness caused by LLMs' inconsistent generative behavior. For instance, LLMs may display contradictory stereotypes, such as those related to gender or race, for identical professions in different contexts. Ignoring this inconsistency risks misleading conclusions in alignment assessments and undermines efforts to evaluate the potential of LLMs to perpetuate or amplify social biases and unfairness. To address this, we propose the Bias-Volatility Framework (BVF), which estimates the probability distribution of stereotypes in LLM outputs. By capturing the variation in generative behavior, BVF assesses both the likelihood and degree to which LLM outputs negatively impact vulnerable groups, enabling a quantification of aggregated discrimination risk. Additionally, we introduce a mathematical framework to decompose this risk into bias risk (from the mean of the stereotype distribution) and volatility risk (from its variation). Applying BVF to 12 widely used LLMs, we find: i) Bias risk is the dominant contributor to discrimination; ii) Most LLMs exhibit substantial pro-male stereotypes across nearly all professions; iii) Reinforcement learning from human feedback reduces bias but increases volatility; iv) Discrimination risk correlates with socio-economic factors, such as professional salaries. Finally, we highlight BVF's broader applicability for assessing how generation inconsistencies in LLMs impact behavior beyond stereotypes.
