Table of Contents
Fetching ...

LLM Bias Detection and Mitigation through the Lens of Desired Distributions

Ingroj Shrestha, Padmini Srinivasan

TL;DR

This work reframes bias in LLMs as deviation from a target distribution, proposing distribution-aligned fine-tuning via a weighted adaptive KL loss to match either an equal or real-world gender–profession distribution while preserving language modeling. The methodology combines detection using KL divergence between target and predicted distributions with a flexible mitigation objective, including a uniform KL loss and an adaptive variant that adjusts updates by profession group dynamics and stability. Empirical results across masked and autoregressive models show near-complete bias removal under equality and substantial reductions under real-world distributions, with MLM loss and downstream GLUE/LMEH performance largely preserved. The approach supports factual grounding and reduces hallucinations in high-stakes contexts, offering a practical, tunable framework for distribution-aware debiasing across multiple model families and sizes.

Abstract

Although prior work on bias mitigation has focused on promoting social equality and demographic parity, less attention has been given to aligning LLM's outputs to desired distributions. For example, we might want to align a model with real-world distributions to support factual grounding. Thus, we define bias as deviation from a desired distribution, which may be an equal or real-world distribution, depending on application goals. We propose a weighted adaptive loss based fine-tuning method that aligns LLM's gender-profession output distribution with the desired distribution, while preserving language modeling capability. Using 3 profession sets -- male-dominated, female-dominated, and gender-balanced -- derived from U.S. labor statistics (2024), we assess both our adaptive method for reflecting reality and a non-adaptive variant for equality. Across three masked language models, bias is observed under both distributions. We achieve near-complete mitigation under equality and 30-75% reduction under real-world settings. Autoregressive LLMs show no bias under equality but notable bias under real-world settings, with the Llama Instruct models (3.2-3B, 3.1-8B) achieving a 50-62% reduction.

LLM Bias Detection and Mitigation through the Lens of Desired Distributions

TL;DR

This work reframes bias in LLMs as deviation from a target distribution, proposing distribution-aligned fine-tuning via a weighted adaptive KL loss to match either an equal or real-world gender–profession distribution while preserving language modeling. The methodology combines detection using KL divergence between target and predicted distributions with a flexible mitigation objective, including a uniform KL loss and an adaptive variant that adjusts updates by profession group dynamics and stability. Empirical results across masked and autoregressive models show near-complete bias removal under equality and substantial reductions under real-world distributions, with MLM loss and downstream GLUE/LMEH performance largely preserved. The approach supports factual grounding and reduces hallucinations in high-stakes contexts, offering a practical, tunable framework for distribution-aware debiasing across multiple model families and sizes.

Abstract

Although prior work on bias mitigation has focused on promoting social equality and demographic parity, less attention has been given to aligning LLM's outputs to desired distributions. For example, we might want to align a model with real-world distributions to support factual grounding. Thus, we define bias as deviation from a desired distribution, which may be an equal or real-world distribution, depending on application goals. We propose a weighted adaptive loss based fine-tuning method that aligns LLM's gender-profession output distribution with the desired distribution, while preserving language modeling capability. Using 3 profession sets -- male-dominated, female-dominated, and gender-balanced -- derived from U.S. labor statistics (2024), we assess both our adaptive method for reflecting reality and a non-adaptive variant for equality. Across three masked language models, bias is observed under both distributions. We achieve near-complete mitigation under equality and 30-75% reduction under real-world settings. Autoregressive LLMs show no bias under equality but notable bias under real-world settings, with the Llama Instruct models (3.2-3B, 3.1-8B) achieving a 50-62% reduction.

Paper Structure

This paper contains 31 sections, 12 equations, 14 tables.