Table of Contents
Fetching ...

Prompt Fairness: Sub-group Disparities in LLMs

Meiyu Zhong, Noel Teku, Ravi Tandon

TL;DR

The paper examines prompt fairness in LLMs, revealing that semantically equivalent prompts phrased by different demographic-style groups yield divergent outputs. It introduces information-theoretic metrics, including $H(\hat{Y}|X,t,g)$ for subgroup sensitivity and $D_{g,g'}(t)$ for cross-group divergence, to quantify within-group variability and cross-group differences, respectively. A controlled evaluation pipeline combines subgroup-conditioned paraphrasing, prompt neutralization, and semantic embedding clustering to measure divergence, and it proposes two mitigation strategies: majority voting over prompt variants and demographic cue masking. Empirical results show substantial reductions in cross-group divergence after mitigation (from up to $0.28$ to around $0.17$–$0.22$), demonstrating improved fairness and robustness in outputs across demographic subgroups, with practical implications for equitable deployment of LLMs in high-stakes contexts.

Abstract

Large Language Models (LLMs), though shown to be effective in many applications, can vary significantly in their response quality. In this paper, we investigate this problem of prompt fairness: specifically, the phrasing of a prompt by different users/styles, despite the same question being asked in principle, may elicit different responses from an LLM. To quantify this disparity, we propose to use information-theoretic metrics that can capture two dimensions of bias: subgroup sensitivity, the variability of responses within a subgroup and cross group consistency, the variability of responses across subgroups. Our analysis reveals that certain subgroups exhibit both higher internal variability and greater divergence from others. Our empirical analysis reveals that certain demographic sub groups experience both higher internal variability and greater divergence from others, indicating structural inequities in model behavior. To mitigate these disparities, we propose practical interventions, including majority voting across multiple generations and prompt neutralization, which together improve response stability and enhance fairness across user populations. In the experiments, we observe clear prompt sensitivity disparities across demographic subgroups: before mitigation, cross-group divergence values reach 0.28 and typically fall in the from 0.14 to 0.22 range. After applying our neutralization and multi generation strategy, these divergences consistently decrease, with the largest gap reduced to 0.22 and many distances falling to 0.17 or below, indicating more stable and consistent outputs across subgroups.

Prompt Fairness: Sub-group Disparities in LLMs

TL;DR

The paper examines prompt fairness in LLMs, revealing that semantically equivalent prompts phrased by different demographic-style groups yield divergent outputs. It introduces information-theoretic metrics, including for subgroup sensitivity and for cross-group divergence, to quantify within-group variability and cross-group differences, respectively. A controlled evaluation pipeline combines subgroup-conditioned paraphrasing, prompt neutralization, and semantic embedding clustering to measure divergence, and it proposes two mitigation strategies: majority voting over prompt variants and demographic cue masking. Empirical results show substantial reductions in cross-group divergence after mitigation (from up to to around ), demonstrating improved fairness and robustness in outputs across demographic subgroups, with practical implications for equitable deployment of LLMs in high-stakes contexts.

Abstract

Large Language Models (LLMs), though shown to be effective in many applications, can vary significantly in their response quality. In this paper, we investigate this problem of prompt fairness: specifically, the phrasing of a prompt by different users/styles, despite the same question being asked in principle, may elicit different responses from an LLM. To quantify this disparity, we propose to use information-theoretic metrics that can capture two dimensions of bias: subgroup sensitivity, the variability of responses within a subgroup and cross group consistency, the variability of responses across subgroups. Our analysis reveals that certain subgroups exhibit both higher internal variability and greater divergence from others. Our empirical analysis reveals that certain demographic sub groups experience both higher internal variability and greater divergence from others, indicating structural inequities in model behavior. To mitigate these disparities, we propose practical interventions, including majority voting across multiple generations and prompt neutralization, which together improve response stability and enhance fairness across user populations. In the experiments, we observe clear prompt sensitivity disparities across demographic subgroups: before mitigation, cross-group divergence values reach 0.28 and typically fall in the from 0.14 to 0.22 range. After applying our neutralization and multi generation strategy, these divergences consistently decrease, with the largest gap reduced to 0.22 and many distances falling to 0.17 or below, indicating more stable and consistent outputs across subgroups.

Paper Structure

This paper contains 14 sections, 5 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Showing three variations of a review from three subgroups passed to an LLM for sentiment classification. The LLM responds differently based on the tone of the prompt variation, which reveals the bias through prompt rephrasing by different groups (white male, black male, valley girl).
  • Figure 2: This figure illustrates our prompt neutralization pipeline. An input prompt is first rephrased multiple times using LLaMA-13B, and these variants are then processed by a Prompt Neutralizer to remove stylistic or demographic cues. The normalized prompts are first processed by OpenThinker-7B, and the resulting outputs are then fed into a separate semantic embedding model to obtain vector representations; these embeddings are subsequently clustered into $C_1, C_2, \dots, C_k$ to assess consistency across rephrased prompts.
  • Figure 3: (a) Subgroup consistency measured via symmetric KL divergence on Adult dataset before bias mitigation. (b) Subgroup consistency measured via symmetric KL divergence on Adult dataset after bias mitigation. The matrix displays pairwise consistency scores between demographic subgroups. A higher value indicates greater inconsistency. Notably, the (B, F) and (B, M) group exhibits the highest pairwise inconsistency, suggesting substantial variation in model predictions across this subgroup pairing.
  • Figure 4: (a) Subgroup consistency measured via JS divergence on Bold Dataset before bias mitigation. (b) Subgroup consistency measured via JS divergence on Bold Dataset after bias mitigation. The matrix displays pairwise consistency scores between demographic subgroups. A higher value indicates greater inconsistency. Notably, the (B, F) and (B, M) group exhibits the highest pairwise inconsistency, suggesting substantial variation in model predictions across this subgroup pairing.
  • Figure 5: Illustration of our LLM-based pipeline for converting the Adult (binary tabular) dataset into natural-language prompts, paraphrasing the queries, and predicting income labels. Each structured row (e.g., education, occupation, hours-per-week) is transformed into a contextual instruction-based prompt suitable for LLM reasoning. We then apply a controlled paraphrasing step to generate semantically equivalent variants without altering meaning, followed by a prediction prompt that asks the LLM to determine whether the individual earns more than $50$K per year. This process bridges tabular inputs and language-based classification, enabling fair and consistent evaluation of model behavior across rephrased prompts.
  • ...and 3 more figures

Theorems & Definitions (3)

  • Definition 1
  • Definition 2
  • Definition 3