Table of Contents
Fetching ...

BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses

Xin Xu, Xunzhi He, Churan Zhi, Ruizhe Chen, Julian McAuley, Zexue He

TL;DR

BiasFreeBench addresses the lack of standardized, real-use evaluation for bias mitigation in large language models by focusing on response-level bias across two realistic QA-style scenarios (BBQ and FairMT-Bench). It evaluates eight debiasing techniques (four prompting-based and four training-based) using a novel Bias-Free Score, across seven LLMs of varying sizes. The study finds prompting-based methods—particularly Chain-of-Thought prompting—consistently outperform training-based approaches, with larger models further enhancing prompting effectiveness and certain training methods (like DPO) generalizing across bias types. This benchmark provides a unified testbed for fair, safe, and anti-stereotypical LLM responses and offers practical guidance for deploying bias-mitigated systems in real-world settings.

Abstract

Existing studies on bias mitigation methods for large language models (LLMs) use diverse baselines and metrics to evaluate debiasing performance, leading to inconsistent comparisons among them. Moreover, their evaluations are mostly based on the comparison between LLMs' probabilities of biased and unbiased contexts, which ignores the gap between such evaluations and real-world use cases where users interact with LLMs by reading model responses and expect fair and safe outputs rather than LLMs' probabilities. To enable consistent evaluation across debiasing methods and bridge this gap, we introduce BiasFreeBench, an empirical benchmark that comprehensively compares eight mainstream bias mitigation techniques (covering four prompting-based and four training-based methods) on two test scenarios (multi-choice QA and open-ended multi-turn QA) by reorganizing existing datasets into a unified query-response setting. We further introduce a response-level metric, Bias-Free Score, to measure the extent to which LLM responses are fair, safe, and anti-stereotypical. Debiasing performances are systematically compared and analyzed across key dimensions: the prompting vs. training paradigm, model size, and generalization of different training strategies to unseen bias types. We will publicly release our benchmark, aiming to establish a unified testbed for bias mitigation research.

BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses

TL;DR

BiasFreeBench addresses the lack of standardized, real-use evaluation for bias mitigation in large language models by focusing on response-level bias across two realistic QA-style scenarios (BBQ and FairMT-Bench). It evaluates eight debiasing techniques (four prompting-based and four training-based) using a novel Bias-Free Score, across seven LLMs of varying sizes. The study finds prompting-based methods—particularly Chain-of-Thought prompting—consistently outperform training-based approaches, with larger models further enhancing prompting effectiveness and certain training methods (like DPO) generalizing across bias types. This benchmark provides a unified testbed for fair, safe, and anti-stereotypical LLM responses and offers practical guidance for deploying bias-mitigated systems in real-world settings.

Abstract

Existing studies on bias mitigation methods for large language models (LLMs) use diverse baselines and metrics to evaluate debiasing performance, leading to inconsistent comparisons among them. Moreover, their evaluations are mostly based on the comparison between LLMs' probabilities of biased and unbiased contexts, which ignores the gap between such evaluations and real-world use cases where users interact with LLMs by reading model responses and expect fair and safe outputs rather than LLMs' probabilities. To enable consistent evaluation across debiasing methods and bridge this gap, we introduce BiasFreeBench, an empirical benchmark that comprehensively compares eight mainstream bias mitigation techniques (covering four prompting-based and four training-based methods) on two test scenarios (multi-choice QA and open-ended multi-turn QA) by reorganizing existing datasets into a unified query-response setting. We further introduce a response-level metric, Bias-Free Score, to measure the extent to which LLM responses are fair, safe, and anti-stereotypical. Debiasing performances are systematically compared and analyzed across key dimensions: the prompting vs. training paradigm, model size, and generalization of different training strategies to unseen bias types. We will publicly release our benchmark, aiming to establish a unified testbed for bias mitigation research.

Paper Structure

This paper contains 48 sections, 7 equations, 17 figures, 12 tables.

Figures (17)

  • Figure 1: BiasFreeBench comprehensively compare prompting-based and training-based techniques to mitigate bias in LLM responses. They are evaluated on QA-based bias datasets with a response-level metric, Bias-Free Score.
  • Figure 2: Instructions for the prompting-based debiasing methods
  • Figure 3: Four training-based bias mitigation techniques explored in BiasFreeBench.
  • Figure 4: Mean and standard deviation of BFS (%) across 4 prompting-based and 3 training-based methods on different sizes of Qwen2.5.
  • Figure 5: (a) Bias-Free Score (%) across 9 bias types on the BBQ dataset. (b) (c) $\Delta$BFS of SFT and DPO with single bias type training data. "[Bias Type] SFT/DPO" (e.g., Gender DPO) denotes training with data only from one specific bias type. "SFT/DPO" indicates training with data from all bias types. Areas with negative improvements are shaded in grey.
  • ...and 12 more figures