Table of Contents
Fetching ...

Prompting Fairness: Integrating Causality to Debias Large Language Models

Jingling Li, Zeyu Tang, Xiaoyu Liu, Peter Spirtes, Kun Zhang, Liu Leqi, Yang Liu

TL;DR

This work targets social biases in large language models by introducing a causality-guided debiasing framework that treats data generation and model reasoning as causal processes. It proposes three prompting strategies—nudging toward social-agnostic facts, counteracting historical biases, and nudging away from social-salient text—enabled by selection mechanisms to regulate information flow. Empirically, the approach yields significant bias reductions on WinoBias and BBQ across multiple models, including black-box access scenarios, with the strongest results when all strategies are combined. The framework thus offers theoretically grounded, practically applicable prompting guidelines for debiasing LLMs in high-stakes settings and opens avenues for reward-modeling and broader fairness research.

Abstract

Large language models (LLMs), despite their remarkable capabilities, are susceptible to generating biased and discriminatory responses. As LLMs increasingly influence high-stakes decision-making (e.g., hiring and healthcare), mitigating these biases becomes critical. In this work, we propose a causality-guided debiasing framework to tackle social biases, aiming to reduce the objectionable dependence between LLMs' decisions and the social information in the input. Our framework introduces a novel perspective to identify how social information can affect an LLM's decision through different causal pathways. Leveraging these causal insights, we outline principled prompting strategies that regulate these pathways through selection mechanisms. This framework not only unifies existing prompting-based debiasing techniques, but also opens up new directions for reducing bias by encouraging the model to prioritize fact-based reasoning over reliance on biased social cues. We validate our framework through extensive experiments on real-world datasets across multiple domains, demonstrating its effectiveness in debiasing LLM decisions, even with only black-box access to the model.

Prompting Fairness: Integrating Causality to Debias Large Language Models

TL;DR

This work targets social biases in large language models by introducing a causality-guided debiasing framework that treats data generation and model reasoning as causal processes. It proposes three prompting strategies—nudging toward social-agnostic facts, counteracting historical biases, and nudging away from social-salient text—enabled by selection mechanisms to regulate information flow. Empirically, the approach yields significant bias reductions on WinoBias and BBQ across multiple models, including black-box access scenarios, with the strongest results when all strategies are combined. The framework thus offers theoretically grounded, practically applicable prompting guidelines for debiasing LLMs in high-stakes settings and opens avenues for reward-modeling and broader fairness research.

Abstract

Large language models (LLMs), despite their remarkable capabilities, are susceptible to generating biased and discriminatory responses. As LLMs increasingly influence high-stakes decision-making (e.g., hiring and healthcare), mitigating these biases becomes critical. In this work, we propose a causality-guided debiasing framework to tackle social biases, aiming to reduce the objectionable dependence between LLMs' decisions and the social information in the input. Our framework introduces a novel perspective to identify how social information can affect an LLM's decision through different causal pathways. Leveraging these causal insights, we outline principled prompting strategies that regulate these pathways through selection mechanisms. This framework not only unifies existing prompting-based debiasing techniques, but also opens up new directions for reducing bias by encouraging the model to prioritize fact-based reasoning over reliance on biased social cues. We validate our framework through extensive experiments on real-world datasets across multiple domains, demonstrating its effectiveness in debiasing LLM decisions, even with only black-box access to the model.
Paper Structure (48 sections, 1 theorem, 4 equations, 10 figures, 7 tables)

This paper contains 48 sections, 1 theorem, 4 equations, 10 figures, 7 tables.

Key Result

Theorem 3.1

If conditions and constraints specified in Equations (eq:nudging_towards_fact), (eq:counteract_historical_discrimination:counteract_existing), and (eq:nudging_away_from_awaretext) are simultaneously satisfied, the LLM's decision $Y$ is independent from the social category $A$ in the presence of PPC

Figures (10)

  • Figure 1: Panel (a): A biased answer may be due to the use of a gender shortcut, while a fact-based answer is made by considering proper world knowledge given the circumstances. Panel (b): we describe how to systematically generate prompts that encourage fact-based reasoning. We would like to note that using social category information does not necessarily indicate that the reasoning is biased: sometimes, certain social category information should be considered, e.g., gender in medical treatments. We call such dependence neutral dependence, while in this work, we focus on objectionable/problematic ones tang2024procedural.
  • Figure 2: An illustrative example of selection mechanisms.
  • Figure 3: Causal graphs for different data generating processes. We use double-stroke contours to indicate selection variables, solid edges to represent causal relations among observed variables or internal representations, and dashed edges for those pertaining to selection mechanisms. Panel (a) presents the underlying data generating process of the training data corpus. Panel (b) presents a causal perspective on how LLM's decisions are related to internal representations and modulated by external prompts. We highlight in light coral to denote pathways along which social category information can influence the LLM's decision.
  • Figure 4: Additional illustrations on debiasing strategies.
  • Figure 5: Performance comparison on Discrim-Eval across three demographics. The bar denotes the degree of discrimination by comparing the least privileged group with the most privileged group in a given demographic category (the higher the bar, the deeper the discrimination). Different methods (prompt designs) are colored differently (lighter colors denote the ones that amplify fact-based reasoning). Encouraging fact-based reasoning universally decreases the relative gap when added with methods that reduce biased reasoning.
  • ...and 5 more figures

Theorems & Definitions (3)

  • Theorem 3.1: Comprehensive Debiasing When Combining All Three Strategies
  • Remark 3.2
  • proof