Table of Contents
Fetching ...

Reinforcement Learning from Multi-role Debates as Feedback for Bias Mitigation in LLMs

Ruoxi Cheng, Haoxuan Ma, Shuirong Cao, Jiaqi Li, Aihua Pei, Zhiqiang Wang, Pengliang Ji, Haoyu Wang, Jiaqi Huo

TL;DR

This work tackles bias in large language models by addressing shortcomings of human-feedback-heavy RLHF. It introduces Reinforcement Learning from Multi-role Debates as Feedback (RLDF), which builds a bias-aware dataset from structured multi-role debates, trains a reward model from high-bias versus low-bias pairs, and uses PPO to iteratively improve bias mitigation without human labeling. The approach comprises self-reflection and teacher-student modes, with comprehensive experiments across multiple LLMs and bias types showing enhanced bias reduction while maintaining response quality. The results demonstrate RLDF’s scalability and potential to generalize bias mitigation beyond specific prompts, offering a practical pathway for safer LLM deployment.

Abstract

Bias in LLMs can harm user experience and societal outcomes. However, current bias mitigation methods often require intensive human feedback, lack transferability to other topics or yield overconfident and random outputs. We find that involving LLMs in role-playing scenario boosts their ability to recognize and mitigate biases. Based on this, we propose Reinforcement Learning from Multi-role Debates as Feedback (RLDF), a novel approach for bias mitigation replacing human feedback in traditional RLHF. We utilize LLMs in multi-role debates to create a dataset that includes both high-bias and low-bias instances for training the reward model in reinforcement learning. Our approach comprises two modes: (1) self-reflection, where the same LLM participates in multi-role debates, and (2) teacher-student, where a more advanced LLM like GPT-3.5-turbo guides the LLM to perform this task. Experimental results across different LLMs on BBQ and our datasets demonstrate the effectiveness of our approach in bias mitigation. Our source code and datasets are available at \texttt{https://anonymous.4open.science/r/RLDF-E344}.

Reinforcement Learning from Multi-role Debates as Feedback for Bias Mitigation in LLMs

TL;DR

This work tackles bias in large language models by addressing shortcomings of human-feedback-heavy RLHF. It introduces Reinforcement Learning from Multi-role Debates as Feedback (RLDF), which builds a bias-aware dataset from structured multi-role debates, trains a reward model from high-bias versus low-bias pairs, and uses PPO to iteratively improve bias mitigation without human labeling. The approach comprises self-reflection and teacher-student modes, with comprehensive experiments across multiple LLMs and bias types showing enhanced bias reduction while maintaining response quality. The results demonstrate RLDF’s scalability and potential to generalize bias mitigation beyond specific prompts, offering a practical pathway for safer LLM deployment.

Abstract

Bias in LLMs can harm user experience and societal outcomes. However, current bias mitigation methods often require intensive human feedback, lack transferability to other topics or yield overconfident and random outputs. We find that involving LLMs in role-playing scenario boosts their ability to recognize and mitigate biases. Based on this, we propose Reinforcement Learning from Multi-role Debates as Feedback (RLDF), a novel approach for bias mitigation replacing human feedback in traditional RLHF. We utilize LLMs in multi-role debates to create a dataset that includes both high-bias and low-bias instances for training the reward model in reinforcement learning. Our approach comprises two modes: (1) self-reflection, where the same LLM participates in multi-role debates, and (2) teacher-student, where a more advanced LLM like GPT-3.5-turbo guides the LLM to perform this task. Experimental results across different LLMs on BBQ and our datasets demonstrate the effectiveness of our approach in bias mitigation. Our source code and datasets are available at \texttt{https://anonymous.4open.science/r/RLDF-E344}.
Paper Structure (49 sections, 2 equations, 13 figures, 14 tables)

This paper contains 49 sections, 2 equations, 13 figures, 14 tables.

Figures (13)

  • Figure 1: Asking GPT-3.5-turbo and GPT-2 about the bias in the text it generates using the prompt "Here is our Q&A ","Here is the Q&A between me and a language model" and "Here is the Q&A between me and a language model competing with you", the number of identified biases increases gradually. When informed that the content was generated by itself, the LLM admits to far fewer biased responses than with other prompts.
  • Figure 2: Visualization of the attention scores of the GPT-2 model on the three prompts above.
  • Figure 3: Overview of RLDF. The process starts with creating datasets from multi-role debates. Data pairs of both high and low bias are generated in two modes: (1) self-reflection mode, where the LLM generate the data itself, and (2) teacher-student mode, where a better LLM helps the original one create more logical and less biased content. Data of low bias is labeled and used for supervised fine-tuning (SFT) of the LLM. Data pairs are then used to train a reward model (RW). Finally, the fine-tuned model generates new data, which is evaluated by the reward model. Using Proximal Policy Optimization (PPO), the model is continuously improved to produce less biased outputs.
  • Figure 4: An instance where three roles of different age groups discuss about gardening as a hobby. Age bias within the LLM when portraying specific roles are highlighted. After each dialogue round, the LLM serves as a referee to identify biases, engaging in self-reflection.
  • Figure 5: Comparison with baselines across different LLMs in CE.
  • ...and 8 more figures