Table of Contents
Fetching ...

Bias and Volatility: A Statistical Framework for Evaluating Large Language Model's Stereotypes and the Associated Generation Inconsistency

Yiran Liu, Ke Yang, Zehan Qi, Xiao Liu, Yang Yu, ChengXiang Zhai

TL;DR

The paper introduces the Bias-Volatility Framework (BVF), a statistical approach to quantify both mean bias and context-driven generation volatility in large language models (LLMs). BVF defines a stereotype distribution across contexts, a discrimination risk criterion $J$, and a clear bias–volatility decomposition ($R^b$, $R^v$) to attribute discrimination to systematic bias or generation inconsistency. Applying BVF to 12 LLMs, the study finds bias risk to be the dominant contributor to discrimination, with widespread pro-male stereotypes across occupations; RLHF reduces overall discrimination risk but increases volatility, and discrimination risk correlates with socio-economic factors like salaries. The framework supports automatic context collection, enables cross-model comparisons, and is generalizable to other parametric biases and modalities, offering a principled tool for fairness auditing and risk management in real-world deployments.

Abstract

We present a novel statistical framework for analyzing stereotypes in large language models (LLMs) by systematically estimating the bias and variation in their generation. Current alignment evaluation metrics often overlook stereotypes' randomness caused by LLMs' inconsistent generative behavior. For instance, LLMs may display contradictory stereotypes, such as those related to gender or race, for identical professions in different contexts. Ignoring this inconsistency risks misleading conclusions in alignment assessments and undermines efforts to evaluate the potential of LLMs to perpetuate or amplify social biases and unfairness. To address this, we propose the Bias-Volatility Framework (BVF), which estimates the probability distribution of stereotypes in LLM outputs. By capturing the variation in generative behavior, BVF assesses both the likelihood and degree to which LLM outputs negatively impact vulnerable groups, enabling a quantification of aggregated discrimination risk. Additionally, we introduce a mathematical framework to decompose this risk into bias risk (from the mean of the stereotype distribution) and volatility risk (from its variation). Applying BVF to 12 widely used LLMs, we find: i) Bias risk is the dominant contributor to discrimination; ii) Most LLMs exhibit substantial pro-male stereotypes across nearly all professions; iii) Reinforcement learning from human feedback reduces bias but increases volatility; iv) Discrimination risk correlates with socio-economic factors, such as professional salaries. Finally, we highlight BVF's broader applicability for assessing how generation inconsistencies in LLMs impact behavior beyond stereotypes.

Bias and Volatility: A Statistical Framework for Evaluating Large Language Model's Stereotypes and the Associated Generation Inconsistency

TL;DR

The paper introduces the Bias-Volatility Framework (BVF), a statistical approach to quantify both mean bias and context-driven generation volatility in large language models (LLMs). BVF defines a stereotype distribution across contexts, a discrimination risk criterion , and a clear bias–volatility decomposition (, ) to attribute discrimination to systematic bias or generation inconsistency. Applying BVF to 12 LLMs, the study finds bias risk to be the dominant contributor to discrimination, with widespread pro-male stereotypes across occupations; RLHF reduces overall discrimination risk but increases volatility, and discrimination risk correlates with socio-economic factors like salaries. The framework supports automatic context collection, enables cross-model comparisons, and is generalizable to other parametric biases and modalities, offering a principled tool for fairness auditing and risk management in real-world deployments.

Abstract

We present a novel statistical framework for analyzing stereotypes in large language models (LLMs) by systematically estimating the bias and variation in their generation. Current alignment evaluation metrics often overlook stereotypes' randomness caused by LLMs' inconsistent generative behavior. For instance, LLMs may display contradictory stereotypes, such as those related to gender or race, for identical professions in different contexts. Ignoring this inconsistency risks misleading conclusions in alignment assessments and undermines efforts to evaluate the potential of LLMs to perpetuate or amplify social biases and unfairness. To address this, we propose the Bias-Volatility Framework (BVF), which estimates the probability distribution of stereotypes in LLM outputs. By capturing the variation in generative behavior, BVF assesses both the likelihood and degree to which LLM outputs negatively impact vulnerable groups, enabling a quantification of aggregated discrimination risk. Additionally, we introduce a mathematical framework to decompose this risk into bias risk (from the mean of the stereotype distribution) and volatility risk (from its variation). Applying BVF to 12 widely used LLMs, we find: i) Bias risk is the dominant contributor to discrimination; ii) Most LLMs exhibit substantial pro-male stereotypes across nearly all professions; iii) Reinforcement learning from human feedback reduces bias but increases volatility; iv) Discrimination risk correlates with socio-economic factors, such as professional salaries. Finally, we highlight BVF's broader applicability for assessing how generation inconsistencies in LLMs impact behavior beyond stereotypes.
Paper Structure (46 sections, 19 equations, 17 figures, 9 tables)

This paper contains 46 sections, 19 equations, 17 figures, 9 tables.

Figures (17)

  • Figure 1: Snow metaphorically represents any human or language models. The biases of Snow are manifested in the statistical properties of its perspectives (i.e. $\mathbf{p}_{Snow}(Y|X)$) over a topic (i.e. $Y$) conditioned on an evidence (i.e. $X$), including persistent bias and context-dependent volatility, respectively correlating with the mean and variation of a bias-measuring random variable derived from perspectives.
  • Figure 1: The discrimination risk of various LLMs concerning gender given occupations as evidence, with worst performance emphasized in bold, and the best performance indicated in underlined italic.
  • Figure 2: Our statistical framework for measuring stereotypes in large language models (LLMs). As a case study, we investigate the biases of an LLM regarding $Y=\{Binary\ Gender\}$, with $X=\{Occupations\}$ as the context evidence. Starting with the LLM’s predicted word probability matrix for $Y$ (blue for male and pink for female) conditioned on contexts $C$ augmented with $X$, we apply the discrimination criterion $J$ on each element to transform the word probability matrix into a discrimination risk matrix. We then aggregate the discrimination risk matrix across $C$’s distribution and derive a discrimination risk vector, capturing the risk for each fixed $X=x$. Finally, by aggregating the discrimination risk vector over $X$’s distribution, we obtain the LLM's overall discrimination risk concerning $Y$.
  • Figure 3: Given unbiased predicted probability $\mathbf{p}^{\star}$, how to relate probability $\mathbf{p}$ (middle) to stereotype $s$ (left) and discrimination risk $r$ (right). In addition, risk decomposition is illustrated in the right figure.
  • Figure 4: Our approach to data mining contexts involves i) extracting sentences containing terms from $X$ and $Y$ with coreference, ii) parsing and recording their structure, and iii) tallying their skeletons to estimate the distribution of $C$.
  • ...and 12 more figures

Theorems & Definitions (3)

  • Definition 1
  • Definition 2
  • Definition 3