Mitigating Political Bias in Language Models Through Reinforced Calibration
Ruibo Liu, Chenyan Jia, Jason Wei, Guangxuan Xu, Lili Wang, Soroush Vosoughi
TL;DR
The paper addresses political bias in large language model generation by introducing two metrics (Indirect Bias and Direct Bias) and a reinforcement-learning calibrated debiasing framework that does not require access to training data or retraining the model. It presents two debiasing modes—Word Embedding Debias and Classifier Guided Debias—driven by PPO-inspired rewards and KL regularization, and evaluates them on gender, location, and topic attributes using the Media Cloud dataset. Empirical results show significant reductions in both indirect and direct bias, with human judgments affirming maintained readability and coherence; classifier-guided debiasing generally performs best, albeit with some fluency trade-offs. The approach offers a practical pathway to reducing political bias in pretrained LMs without costly retraining, making it applicable to a wide range of models and deployments.
Abstract
Current large-scale language models can be politically biased as a result of the data they are trained on, potentially causing serious problems when they are deployed in real-world settings. In this paper, we describe metrics for measuring political bias in GPT-2 generation and propose a reinforcement learning (RL) framework for mitigating political biases in generated text. By using rewards from word embeddings or a classifier, our RL framework guides debiased generation without having access to the training data or requiring the model to be retrained. In empirical experiments on three attributes sensitive to political bias (gender, location, and topic), our methods reduced bias according to both our metrics and human evaluation, while maintaining readability and semantic coherence.
