Table of Contents
Fetching ...

Are LLMs (Really) Ideological? An IRT-based Analysis and Alignment Tool for Perceived Socio-Economic Bias in LLMs

Jasmin Wachter, Michael Radloff, Maja Smolej, Katharina Kinder-Kurlanda

TL;DR

This work introduces an Item Response Theory (IRT) framework to quantify perceived socio-economic bias in LLMs without relying on human judgments. By designing a 105-item inventory that probes economic and social ideology and employing a two-stage IRT model (Stage 1: Prefer Not to Answer estimation via a 2PL model; Stage 2: bias estimation via Generalized Partial Credit Model on answered items), the approach separates response avoidance from actual ideological bias. The authors empirically calibrate the method using fine-tuned LLMs (Meta LLaMa-3.2-1B-Instruct and ChatGPT-3.5) and demonstrate that off-the-shelf models often avoid ideological engagement rather than expressing bias, challenging prior claims. The framework supports scalable, standardized bias benchmarking for AI governance and fair alignment, with implications for distinguishing bias from alignment and guiding targeted benchmarking."

Abstract

We introduce an Item Response Theory (IRT)-based framework to detect and quantify socioeconomic bias in large language models (LLMs) without relying on subjective human judgments. Unlike traditional methods, IRT accounts for item difficulty, improving ideological bias estimation. We fine-tune two LLM families (Meta-LLaMa 3.2-1B-Instruct and Chat- GPT 3.5) to represent distinct ideological positions and introduce a two-stage approach: (1) modeling response avoidance and (2) estimating perceived bias in answered responses. Our results show that off-the-shelf LLMs often avoid ideological engagement rather than exhibit bias, challenging prior claims of partisanship. This empirically validated framework enhances AI alignment research and promotes fairer AI governance.

Are LLMs (Really) Ideological? An IRT-based Analysis and Alignment Tool for Perceived Socio-Economic Bias in LLMs

TL;DR

This work introduces an Item Response Theory (IRT) framework to quantify perceived socio-economic bias in LLMs without relying on human judgments. By designing a 105-item inventory that probes economic and social ideology and employing a two-stage IRT model (Stage 1: Prefer Not to Answer estimation via a 2PL model; Stage 2: bias estimation via Generalized Partial Credit Model on answered items), the approach separates response avoidance from actual ideological bias. The authors empirically calibrate the method using fine-tuned LLMs (Meta LLaMa-3.2-1B-Instruct and ChatGPT-3.5) and demonstrate that off-the-shelf models often avoid ideological engagement rather than expressing bias, challenging prior claims. The framework supports scalable, standardized bias benchmarking for AI governance and fair alignment, with implications for distinguishing bias from alignment and guiding targeted benchmarking."

Abstract

We introduce an Item Response Theory (IRT)-based framework to detect and quantify socioeconomic bias in large language models (LLMs) without relying on subjective human judgments. Unlike traditional methods, IRT accounts for item difficulty, improving ideological bias estimation. We fine-tune two LLM families (Meta-LLaMa 3.2-1B-Instruct and Chat- GPT 3.5) to represent distinct ideological positions and introduce a two-stage approach: (1) modeling response avoidance and (2) estimating perceived bias in answered responses. Our results show that off-the-shelf LLMs often avoid ideological engagement rather than exhibit bias, challenging prior claims of partisanship. This empirically validated framework enhances AI alignment research and promotes fairer AI governance.

Paper Structure

This paper contains 73 sections, 2 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Evaluation of Response Avoidance of Tiny-LLaMa lightweight model family (a) Proportion of PNA flagged answers per Run (b) Alignment Score $\theta$.
  • Figure 2: Evaluation of Response Avoidance of GPT model family (a) Proportion of PNA flagged answers per Run (b) Alignment Score $\theta$.
  • Figure 3: Evaluation of Bias in GPT and LLaMa Model Family - Comparison of Ideology Score $\theta$.
  • Figure 4: Evaluation of Response Avoidance (PNA): Item discrimination scores $\alpha_i$ 2PL-Model
  • Figure 5: Evaluation of Response Avoidance (PNA): Item difficulties $\beta_i$ for the 2PL-Model modeling Answer Refusal of LLMs
  • ...and 2 more figures