Table of Contents
Fetching ...

Alignment Reduces Expressed but Not Encoded Gender Bias: A Unified Framework and Study

Nour Bouchouchi, Thiabult Laugel, Xavier Renard, Christophe Marsala, Marie-Jeanne Lesot, Marcin Detyniecki

Abstract

During training, Large Language Models (LLMs) learn social regularities that can lead to gender bias in downstream applications. Most mitigation efforts focus on reducing bias in generated outputs, typically evaluated on structured benchmarks, which raises two concerns: output-level evaluation does not reveal whether alignment modifies the model's underlying representations, and structured benchmarks may not reflect realistic usage scenarios. We propose a unified framework to jointly analyze intrinsic and extrinsic gender bias in LLMs using identical neutral prompts, enabling direct comparison between gender-related information encoded in internal representations and bias expressed in generated outputs. Contrary to prior work reporting weak or inconsistent correlations, we find a consistent association between latent gender information and expressed bias when measured under the unified protocol. We further examine the effect of alignment through supervised fine-tuning aimed at reducing gender bias. Our results suggest that while the latter indeed reduces expressed bias, measurable gender-related associations are still present in internal representations, and can be reactivated under adversarial prompting. Finally, we consider two realistic settings and show that debiasing effects observed on structured benchmarks do not necessarily generalize, e.g., to the case of story generation.

Alignment Reduces Expressed but Not Encoded Gender Bias: A Unified Framework and Study

Abstract

During training, Large Language Models (LLMs) learn social regularities that can lead to gender bias in downstream applications. Most mitigation efforts focus on reducing bias in generated outputs, typically evaluated on structured benchmarks, which raises two concerns: output-level evaluation does not reveal whether alignment modifies the model's underlying representations, and structured benchmarks may not reflect realistic usage scenarios. We propose a unified framework to jointly analyze intrinsic and extrinsic gender bias in LLMs using identical neutral prompts, enabling direct comparison between gender-related information encoded in internal representations and bias expressed in generated outputs. Contrary to prior work reporting weak or inconsistent correlations, we find a consistent association between latent gender information and expressed bias when measured under the unified protocol. We further examine the effect of alignment through supervised fine-tuning aimed at reducing gender bias. Our results suggest that while the latter indeed reduces expressed bias, measurable gender-related associations are still present in internal representations, and can be reactivated under adversarial prompting. Finally, we consider two realistic settings and show that debiasing effects observed on structured benchmarks do not necessarily generalize, e.g., to the case of story generation.
Paper Structure (46 sections, 7 equations, 16 figures, 4 tables)

This paper contains 46 sections, 7 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: Distribution of generated genders (female/male/neutral) and entity-level $bias(e)$ score for Llama on a gendered-concept Professions (top) and a neutral-concept Diseases (bottom), before fine-tuning (left), after fine-tuning (middle), and after fine-tuning with jailbreak instruction (right).
  • Figure 2: Concept-level polarization score $Bias_{pol}(c)$ for 6 concepts studied across 3 models (gemma, Llama and Mistral from left to right) and 3 conditions (before fine-tuning, after fine-tuning, and after fine-tuning with a jailbreak instruction).
  • Figure 3: Entity-level latent gender score $s^{20}(e)$ for Llama, before and after fine-tuning for the concepts Professions (top) and Diseases (bottom).
  • Figure 4: Latent polarization score $\text{S}^{l}_{\text{latent}}(c)$ per concept across layers for Llama (before and after fine-tuning), compared to concept-specific random reference distributions (shaded areas indicate the 2.5%-97.5% quantile interval).
  • Figure 5: Spearman correlation between expressed bias and latent gender scores by layer, for the Professions and Diseases concepts in Llama.
  • ...and 11 more figures