Intra-Fairness Dynamics: The Bias Spillover Effect in Targeted LLM Alignment

Eva Paraschou; Line Harder Clemmensen; Sneha Das

Intra-Fairness Dynamics: The Bias Spillover Effect in Targeted LLM Alignment

Eva Paraschou, Line Harder Clemmensen, Sneha Das

TL;DR

It is demonstrated that improving fairness along one attribute can inadvertently worsen disparities in others under uncertainty, highlighting the necessity of context-aware, multi-attribute fairness evaluation frameworks.

Abstract

Conventional large language model (LLM) fairness alignment largely focuses on mitigating bias along single sensitive attributes, overlooking fairness as an inherently multidimensional and context-specific value. This approach risks creating systems that achieve narrow fairness metrics while exacerbating disparities along untargeted attributes, a phenomenon known as bias spillover. While extensively studied in machine learning, bias spillover remains critically underexplored in LLM alignment. In this work, we investigate how targeted gender alignment affects fairness across nine sensitive attributes in three state-of-the-art LLMs (Mistral 7B, Llama 3.1 8B, Qwen 2.5 7B). Using Direct Preference Optimization and the BBQ benchmark, we evaluate fairness under ambiguous and disambiguous contexts. Our findings reveal noticeable bias spillover: while aggregate results show improvements, context-aware analysis exposes significant degradations in ambiguous contexts, particularly for physical appearance ($p< 0.001$ across all models), sexual orientation, and disability status. We demonstrate that improving fairness along one attribute can inadvertently worsen disparities in others under uncertainty, highlighting the necessity of context-aware, multi-attribute fairness evaluation frameworks.

Intra-Fairness Dynamics: The Bias Spillover Effect in Targeted LLM Alignment

TL;DR

Abstract

across all models), sexual orientation, and disability status. We demonstrate that improving fairness along one attribute can inadvertently worsen disparities in others under uncertainty, highlighting the necessity of context-aware, multi-attribute fairness evaluation frameworks.

Paper Structure (30 sections, 1 equation, 5 figures, 13 tables)

This paper contains 30 sections, 1 equation, 5 figures, 13 tables.

Introduction
Related Work
Sensitive attribute dynamics in ML and DL
LLM Alignment and the BBQ benchmark
Value interactions in LLM Alignment
Methodology
Large language models
Alignment algorithm: DPO
Benchmark
The BBQ benchmark
Training vs. Evaluation sets
Pre-processing for DPO
Metrics
Alignment Accuracy
Statistical Significance
...and 15 more sections

Figures (5)

Figure 1: Samples of the four possible combinations of the contexts (disambiguous and ambiguous) and polarities (non-negative and negative) of the BBQ benchmark.
Figure 2: Comparative analysis of the context-unaware results of Mistral 7B (orange), Llama 3.1 8B (blue), Qwen 2.5 7B (purple) across all sensitive attributes. The light shades correspond to the pre-alignment accuracy, while the dark shades to the post-alignment one. The number of asterisks represent the level of significance in the difference.
Figure 3: The two new alignment questions that derived from transforming an original alignment question (see \ref{['sfig:ambnonneg']}).
Figure 4: Comparative analysis of the ambiguous questions of Mistral 7B (orange), Llama 3.1 8B (blue), Qwen 2.5 7B (purple) across all sensitive attributes. The light shades correspond to the pre-alignment accuracy, while the dark shades to the post-alignment one. The number of asterisks represent the level of significance.
Figure 5: Comparative analysis of the disambiguous questions of Mistral 7B (orange), Llama 3.1 8B (blue), Qwen 2.5 7B (purple) across all sensitive attributes. The light shades correspond to the pre-alignment accuracy, while the dark shades to the post-alignment one. The number of asterisks represent the level of significance.

Intra-Fairness Dynamics: The Bias Spillover Effect in Targeted LLM Alignment

TL;DR

Abstract

Intra-Fairness Dynamics: The Bias Spillover Effect in Targeted LLM Alignment

Authors

TL;DR

Abstract

Table of Contents

Figures (5)