Table of Contents
Fetching ...

Mitigating Political Bias in Language Models Through Reinforced Calibration

Ruibo Liu, Chenyan Jia, Jason Wei, Guangxuan Xu, Lili Wang, Soroush Vosoughi

TL;DR

The paper addresses political bias in large language model generation by introducing two metrics (Indirect Bias and Direct Bias) and a reinforcement-learning calibrated debiasing framework that does not require access to training data or retraining the model. It presents two debiasing modes—Word Embedding Debias and Classifier Guided Debias—driven by PPO-inspired rewards and KL regularization, and evaluates them on gender, location, and topic attributes using the Media Cloud dataset. Empirical results show significant reductions in both indirect and direct bias, with human judgments affirming maintained readability and coherence; classifier-guided debiasing generally performs best, albeit with some fluency trade-offs. The approach offers a practical pathway to reducing political bias in pretrained LMs without costly retraining, making it applicable to a wide range of models and deployments.

Abstract

Current large-scale language models can be politically biased as a result of the data they are trained on, potentially causing serious problems when they are deployed in real-world settings. In this paper, we describe metrics for measuring political bias in GPT-2 generation and propose a reinforcement learning (RL) framework for mitigating political biases in generated text. By using rewards from word embeddings or a classifier, our RL framework guides debiased generation without having access to the training data or requiring the model to be retrained. In empirical experiments on three attributes sensitive to political bias (gender, location, and topic), our methods reduced bias according to both our metrics and human evaluation, while maintaining readability and semantic coherence.

Mitigating Political Bias in Language Models Through Reinforced Calibration

TL;DR

The paper addresses political bias in large language model generation by introducing two metrics (Indirect Bias and Direct Bias) and a reinforcement-learning calibrated debiasing framework that does not require access to training data or retraining the model. It presents two debiasing modes—Word Embedding Debias and Classifier Guided Debias—driven by PPO-inspired rewards and KL regularization, and evaluates them on gender, location, and topic attributes using the Media Cloud dataset. Empirical results show significant reductions in both indirect and direct bias, with human judgments affirming maintained readability and coherence; classifier-guided debiasing generally performs best, albeit with some fluency trade-offs. The approach offers a practical pathway to reducing political bias in pretrained LMs without costly retraining, making it applicable to a wide range of models and deployments.

Abstract

Current large-scale language models can be politically biased as a result of the data they are trained on, potentially causing serious problems when they are deployed in real-world settings. In this paper, we describe metrics for measuring political bias in GPT-2 generation and propose a reinforcement learning (RL) framework for mitigating political biases in generated text. By using rewards from word embeddings or a classifier, our RL framework guides debiased generation without having access to the training data or requiring the model to be retrained. In empirical experiments on three attributes sensitive to political bias (gender, location, and topic), our methods reduced bias according to both our metrics and human evaluation, while maintaining readability and semantic coherence.

Paper Structure

This paper contains 34 sections, 9 equations, 2 figures, 9 tables, 1 algorithm.

Figures (2)

  • Figure 1: Two modes of our RL-guided debias method.
  • Figure 2: (a) and (b): The UMAP 2D visualization of 5,606 sentences generated by vanilla GPT-2 when the sentence embeddings are encoding output of (a) not pretrained XLNet, (b) pretrained XLNet on Media Cloud Dataset ($F1$ =0.98). (c) and (d) are visualization of debiased sentences by Mode 1 and Mode 2. The embeddings of (c) (d) are both from pretrained XLNet. We mark the class of each sentence (L / C ) labeled by the pretrained XLNet classifier.