Table of Contents
Fetching ...

DeFrame: Debiasing Large Language Models Against Framing Effects

Kahee Lim, Soyeon Kim, Steven Euijong Whang

TL;DR

The paper reframes LLM fairness by introducing framing disparity, $FD(M_\theta; P^+, P^-)$, to quantify how bias measurements shift with positive versus negative framings. It shows that existing debiasing methods reduce overall bias but often fail to stabilize bias across framings. To address this, it proposes DeFrame, a framing-aware, System 2–like debiasing framework with Framing Integration, Guideline Generation, and Self-Revision steps, which reduces both $|FD|$ and bias on BBQ, DoNotAnswer-Framed, and 70Decisions-Framed across multiple models. The results underscore the need for framing-aware fairness methods and demonstrate that robust, prompt-based debiasing can yield fairer, more consistent LLM behavior in realistic, framing-rich settings.

Abstract

As large language models (LLMs) are increasingly deployed in real-world applications, ensuring their fair responses across demographics has become crucial. Despite many efforts, an ongoing challenge is hidden bias: LLMs appear fair under standard evaluations, but can produce biased responses outside those evaluation settings. In this paper, we identify framing -- differences in how semantically equivalent prompts are expressed (e.g., "A is better than B" vs. "B is worse than A") -- as an underexplored contributor to this gap. We first introduce the concept of "framing disparity" to quantify the impact of framing on fairness evaluation. By augmenting fairness evaluation benchmarks with alternative framings, we find that (1) fairness scores vary significantly with framing and (2) existing debiasing methods improve overall (i.e., frame-averaged) fairness, but often fail to reduce framing-induced disparities. To address this, we propose a framing-aware debiasing method that encourages LLMs to be more consistent across framings. Experiments demonstrate that our approach reduces overall bias and improves robustness against framing disparities, enabling LLMs to produce fairer and more consistent responses.

DeFrame: Debiasing Large Language Models Against Framing Effects

TL;DR

The paper reframes LLM fairness by introducing framing disparity, , to quantify how bias measurements shift with positive versus negative framings. It shows that existing debiasing methods reduce overall bias but often fail to stabilize bias across framings. To address this, it proposes DeFrame, a framing-aware, System 2–like debiasing framework with Framing Integration, Guideline Generation, and Self-Revision steps, which reduces both and bias on BBQ, DoNotAnswer-Framed, and 70Decisions-Framed across multiple models. The results underscore the need for framing-aware fairness methods and demonstrate that robust, prompt-based debiasing can yield fairer, more consistent LLM behavior in realistic, framing-rich settings.

Abstract

As large language models (LLMs) are increasingly deployed in real-world applications, ensuring their fair responses across demographics has become crucial. Despite many efforts, an ongoing challenge is hidden bias: LLMs appear fair under standard evaluations, but can produce biased responses outside those evaluation settings. In this paper, we identify framing -- differences in how semantically equivalent prompts are expressed (e.g., "A is better than B" vs. "B is worse than A") -- as an underexplored contributor to this gap. We first introduce the concept of "framing disparity" to quantify the impact of framing on fairness evaluation. By augmenting fairness evaluation benchmarks with alternative framings, we find that (1) fairness scores vary significantly with framing and (2) existing debiasing methods improve overall (i.e., frame-averaged) fairness, but often fail to reduce framing-induced disparities. To address this, we propose a framing-aware debiasing method that encourages LLMs to be more consistent across framings. Experiments demonstrate that our approach reduces overall bias and improves robustness against framing disparities, enabling LLMs to produce fairer and more consistent responses.
Paper Structure (33 sections, 6 equations, 12 figures, 15 tables)

This paper contains 33 sections, 6 equations, 12 figures, 15 tables.

Figures (12)

  • Figure 1: (a) An example of the framing effect using a gender stereotype. The responses of LLMs can show different bias levels when the same stereotype is framed differently; see evaluation on 8 LLMs in Sec. \ref{['sec:framing_disparity_evaluation']}. (b) The overall process of our DeFrame framework (Sec. \ref{['sec:debiasing_method']}) on the BBQ benchmark. We rephrase the input prompts with alternative framings, generate fairness guidelines, and revise the initial responses of LLMs to produce more consistent and fair responses. The example shown is an actual debiasing process on Qwen2.5-3b-Instruct.
  • Figure 2: Bias levels and framing disparities (FD) across baselines on the three benchmarks. We report the average absolute values of each metric across 8 LLMs to capture their overall magnitude (see Appendix \ref{['appen:full_experimental_result_baselines']} for full model-wise results). (Left) Bias score and framing disparity on BBQ. (Middle) Harmful response rate (HRR) and framing disparity on DoNotAnswer-Framed. (Right) Discrimination score and framing disparity on 70Decisions-Framed. Across the three benchmarks, DeFrame generally achieves the lowest bias level and framing disparity.
  • Figure 3: Full accuracy results on the BBQ benchmark across 7 demographic categories, covering all baselines and our proposed method, DeFrame. This table presents results for LLaMA3.2-3b-instruct, LLaMA3.1-8b-instruct, Qwen2.5-3b-instruct, and Qwen2.5-7b-instruct.
  • Figure 4: Full accuracy results on the BBQ benchmark across 7 demographic categories, covering all baselines and our proposed method, DeFrame. This table presents results for Qwen2.5-14b-instruct, Gemma3-4b-instruct, Gemma3-12b-instruct, and Mistral-7b-instruct.
  • Figure 5: Full bias score results on the BBQ benchmark across 7 demographic categories and 8 models, covering all baselines and our proposed method, DeFrame. This table presents results for LLaMA3.2-3b-instruct, LLaMA3.1-8b-instruct, Qwen2.5-3b-instruct, and Qwen2.5-7b-instruct.
  • ...and 7 more figures

Theorems & Definitions (1)

  • Definition 1: Framing Disparity