Table of Contents
Fetching ...

On the Relationship between Truth and Political Bias in Language Models

Suyash Fulay, William Brannon, Shrestha Mohanty, Cassandra Overney, Elinor Poole-Dayan, Deb Roy, Jad Kabbara

TL;DR

This paper investigates how pursuing truthfulness in language model alignment interacts with political bias. By training vanilla reward models on standard truth datasets and evaluating their bias with TwinViews-13k, the authors uncover a robust left-leaning bias that grows with model size, even when optimizing for ostensibly objective truth. They extend the analysis to truthful reward models trained on multiple truthfulness datasets, finding a similar left bias and demonstrating that explicit political content in these datasets is minimal and not the primary driver; stylistic artifacts explain only part of the bias. The findings reveal a potential tension between truth-telling and political neutrality in RLHF pipelines, highlighting data representativeness and model-scale considerations as key avenues for future research.

Abstract

Language model alignment research often attempts to ensure that models are not only helpful and harmless, but also truthful and unbiased. However, optimizing these objectives simultaneously can obscure how improving one aspect might impact the others. In this work, we focus on analyzing the relationship between two concepts essential in both language model alignment and political science: truthfulness and political bias. We train reward models on various popular truthfulness datasets and subsequently evaluate their political bias. Our findings reveal that optimizing reward models for truthfulness on these datasets tends to result in a left-leaning political bias. We also find that existing open-source reward models (i.e., those trained on standard human preference datasets) already show a similar bias and that the bias is larger for larger models. These results raise important questions about the datasets used to represent truthfulness, potential limitations of aligning models to be both truthful and politically unbiased, and what language models capture about the relationship between truth and politics.

On the Relationship between Truth and Political Bias in Language Models

TL;DR

This paper investigates how pursuing truthfulness in language model alignment interacts with political bias. By training vanilla reward models on standard truth datasets and evaluating their bias with TwinViews-13k, the authors uncover a robust left-leaning bias that grows with model size, even when optimizing for ostensibly objective truth. They extend the analysis to truthful reward models trained on multiple truthfulness datasets, finding a similar left bias and demonstrating that explicit political content in these datasets is minimal and not the primary driver; stylistic artifacts explain only part of the bias. The findings reveal a potential tension between truth-telling and political neutrality in RLHF pipelines, highlighting data representativeness and model-scale considerations as key avenues for future research.

Abstract

Language model alignment research often attempts to ensure that models are not only helpful and harmless, but also truthful and unbiased. However, optimizing these objectives simultaneously can obscure how improving one aspect might impact the others. In this work, we focus on analyzing the relationship between two concepts essential in both language model alignment and political science: truthfulness and political bias. We train reward models on various popular truthfulness datasets and subsequently evaluate their political bias. Our findings reveal that optimizing reward models for truthfulness on these datasets tends to result in a left-leaning political bias. We also find that existing open-source reward models (i.e., those trained on standard human preference datasets) already show a similar bias and that the bias is larger for larger models. These results raise important questions about the datasets used to represent truthfulness, potential limitations of aligning models to be both truthful and politically unbiased, and what language models capture about the relationship between truth and politics.
Paper Structure (32 sections, 2 figures, 8 tables)

This paper contains 32 sections, 2 figures, 8 tables.

Figures (2)

  • Figure 1: Vanilla open-source reward models have a clear left-leaning political bias. All three subplots show reward scores on the paired TwinViews political statements data, with histograms broken out for the left and right sides. Dashed vertical lines indicate each side's mean reward; a left political bias is indicated by a higher value for the blue line than the red line. The magnitude of the bias (difference in group means divided by pooled SD) is shown on each subplot. Note the presence of inverse scaling: Both model sizes and bias increase from left to right (although the training datasets/methods are different across the models).
  • Figure 2: "Truthful" reward models usually show a left-leaning political bias. The left three subplots show rewards assigned to TwinViews political statements by models fine-tuned on each truthfulness dataset, excluding explicitly political content found by our audit. We run five train/eval splits for each dataset and model. Individual points show results from each run, with blue points representing the average reward given to left-leaning statements and red points representing the average reward given to right-leaning statements. The red and blue bar heights show the average reward across all five runs (i.e. the average of the corresponding point values). Note the presence of inverse scaling: Larger models usually skew further left. Results of \ref{['subsec:truthful-reward-bias-stylistic']}'s n-gram experiment appear in the rightmost pane, showing no clear relationship to the neural models' patterns.