Table of Contents
Fetching ...

Exploring the Jungle of Bias: Political Bias Attribution in Language Models via Dependency Analysis

David F. Jenny, Yann Billeter, Mrinmaya Sachan, Bernhard Schölkopf, Zhijing Jin

TL;DR

This work investigates the internal causes of bias in large language models by adopting a causal fairness framework. It defines a Standard Fairness Model with protected attributes, mediators, and confounders, and demonstrates that LLM bias in political argument evaluation emerges through complex, indirect pathways rather than simple direct discrimination. The authors introduce a prompt-based attribute extraction pipeline and use Activity Dependency Networks to non-parametrically map interactions among extracted attributes and outcomes, validating findings with attribute perturbations and bootstrap analyses. The study highlights the limitations of direct fine-tuning for debiasing and argues for causal attribution-guided mitigation, with implications for alignment, transparency, and responsible AI deployment in high-stakes political discourse.

Abstract

The rapid advancement of Large Language Models (LLMs) has sparked intense debate regarding the prevalence of bias in these models and its mitigation. Yet, as exemplified by both results on debiasing methods in the literature and reports of alignment-related defects from the wider community, bias remains a poorly understood topic despite its practical relevance. To enhance the understanding of the internal causes of bias, we analyse LLM bias through the lens of causal fairness analysis, which enables us to both comprehend the origins of bias and reason about its downstream consequences and mitigation. To operationalize this framework, we propose a prompt-based method for the extraction of confounding and mediating attributes which contribute to the LLM decision process. By applying Activity Dependency Networks (ADNs), we then analyse how these attributes influence an LLM's decision process. We apply our method to LLM ratings of argument quality in political debates. We find that the observed disparate treatment can at least in part be attributed to confounding and mitigating attributes and model misalignment, and discuss the consequences of our findings for human-AI alignment and bias mitigation. Our code and data are at https://github.com/david-jenny/LLM-Political-Study.

Exploring the Jungle of Bias: Political Bias Attribution in Language Models via Dependency Analysis

TL;DR

This work investigates the internal causes of bias in large language models by adopting a causal fairness framework. It defines a Standard Fairness Model with protected attributes, mediators, and confounders, and demonstrates that LLM bias in political argument evaluation emerges through complex, indirect pathways rather than simple direct discrimination. The authors introduce a prompt-based attribute extraction pipeline and use Activity Dependency Networks to non-parametrically map interactions among extracted attributes and outcomes, validating findings with attribute perturbations and bootstrap analyses. The study highlights the limitations of direct fine-tuning for debiasing and argues for causal attribution-guided mitigation, with implications for alignment, transparency, and responsible AI deployment in high-stakes political discourse.

Abstract

The rapid advancement of Large Language Models (LLMs) has sparked intense debate regarding the prevalence of bias in these models and its mitigation. Yet, as exemplified by both results on debiasing methods in the literature and reports of alignment-related defects from the wider community, bias remains a poorly understood topic despite its practical relevance. To enhance the understanding of the internal causes of bias, we analyse LLM bias through the lens of causal fairness analysis, which enables us to both comprehend the origins of bias and reason about its downstream consequences and mitigation. To operationalize this framework, we propose a prompt-based method for the extraction of confounding and mediating attributes which contribute to the LLM decision process. By applying Activity Dependency Networks (ADNs), we then analyse how these attributes influence an LLM's decision process. We apply our method to LLM ratings of argument quality in political debates. We find that the observed disparate treatment can at least in part be attributed to confounding and mitigating attributes and model misalignment, and discuss the consequences of our findings for human-AI alignment and bias mitigation. Our code and data are at https://github.com/david-jenny/LLM-Political-Study.
Paper Structure (61 sections, 2 equations, 11 figures, 1 table)

This paper contains 61 sections, 2 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: (Undesired) Effect of Bias Treatment on Decision Process: The figure depicts how the LLM's perception of value $\textit{A}$ is considered during the decision process while judging $\textit{B}$ and $\textit{C}$ through $f(C|A)$ and $f(B|A)$. Now consider the effect of treating the association of value $\textit{A}$ with $\textit{C}$ ($f(C|A)$) by naively fine-tuning the model to align with this value of interest on other value associations ($f(B|A)$) that are not actively considered. They may be changed indiscriminately, regardless of whether they were already aligned. These associations are currently neither observable nor predictable yet changes in them are potentially harmful. Using the extracted decision processes, we gain information on what areas are prone to such unwanted changes.
  • Figure 2: Paper Overview: We start by processing the input data, followed by extracting normative values from ChatGPT and a subsequent analysis of the causal structures within the data. We then use the resulting causal networks to reason about bias attribution and the problems with bias mitigation via direct fine-tuning.
  • Figure 3: A graphical model of the standard fairness model.
  • Figure 4: Example of Extracted Correlations: Correlations of $\textit{Speaker Party}$, $\textit{Score}$ and the measurement types of $\textit{Score}$ and $\textit{Academic Score}$ plotted against an example subset of the attributes. This plot aims to give an example of the dataset and demonstrate the susceptibility of the correlations on the exact definitions. See \ref{['app:pol_extra_plots']} for further plots.
  • Figure 5: Distributions of scores assigned by LLM for different definitions. The attribute definitions are given in \ref{['app:all_variables']}.
  • ...and 6 more figures