Table of Contents
Fetching ...

Aligning to What? Limits to RLHF Based Alignment

Logan Barnhart, Reza Akbarian Bafghi, Stephen Becker, Maziar Raissi

TL;DR

The paper interrogates whether Reinforcement Learning from Human Feedback (RLHF) truly aligns large language models with human values, focusing on covert biases against African Americans and examining methods such as Direct Preference Optimization (DPO), Odds Ratio Preference Optimization (ORPO), and REINFORCE Leave-One-Out (RLOO). Using matched-guise bias probing and multimodal extensions, the study finds that RLHF yields only marginal reductions in covert biases and can calcify biases when supervised fine-tuning precedes RLHF. It also shows that multimodal measurements can yield divergent patterns between covert and overt biases, suggesting current alignment techniques struggle with nebulous objectives like harmlessness and bias mitigation. The results advocate for higher-quality, diverse datasets and improved alignment tools to meaningfully address subtle social biases in AI systems.

Abstract

Reinforcement Learning from Human Feedback (RLHF) is increasingly used to align large language models (LLMs) with human preferences. However, the effectiveness of RLHF in addressing underlying biases remains unclear. This study investigates the relationship between RLHF and both covert and overt biases in LLMs, particularly focusing on biases against African Americans. We applied various RLHF techniques (DPO, ORPO, and RLOO) to Llama 3 8B and evaluated the covert and overt biases of the resulting models using matched-guise probing and explicit bias testing. We performed additional tests with DPO on different base models and datasets; among several implications, we found that SFT before RLHF calcifies model biases. Additionally, we extend the tools for measuring biases to multi-modal models. Through our experiments we collect evidence that indicates that current alignment techniques are inadequate for nebulous tasks such as mitigating covert biases, highlighting the need for capable datasets, data curating techniques, or alignment tools.

Aligning to What? Limits to RLHF Based Alignment

TL;DR

The paper interrogates whether Reinforcement Learning from Human Feedback (RLHF) truly aligns large language models with human values, focusing on covert biases against African Americans and examining methods such as Direct Preference Optimization (DPO), Odds Ratio Preference Optimization (ORPO), and REINFORCE Leave-One-Out (RLOO). Using matched-guise bias probing and multimodal extensions, the study finds that RLHF yields only marginal reductions in covert biases and can calcify biases when supervised fine-tuning precedes RLHF. It also shows that multimodal measurements can yield divergent patterns between covert and overt biases, suggesting current alignment techniques struggle with nebulous objectives like harmlessness and bias mitigation. The results advocate for higher-quality, diverse datasets and improved alignment tools to meaningfully address subtle social biases in AI systems.

Abstract

Reinforcement Learning from Human Feedback (RLHF) is increasingly used to align large language models (LLMs) with human preferences. However, the effectiveness of RLHF in addressing underlying biases remains unclear. This study investigates the relationship between RLHF and both covert and overt biases in LLMs, particularly focusing on biases against African Americans. We applied various RLHF techniques (DPO, ORPO, and RLOO) to Llama 3 8B and evaluated the covert and overt biases of the resulting models using matched-guise probing and explicit bias testing. We performed additional tests with DPO on different base models and datasets; among several implications, we found that SFT before RLHF calcifies model biases. Additionally, we extend the tools for measuring biases to multi-modal models. Through our experiments we collect evidence that indicates that current alignment techniques are inadequate for nebulous tasks such as mitigating covert biases, highlighting the need for capable datasets, data curating techniques, or alignment tools.

Paper Structure

This paper contains 33 sections, 6 equations, 48 figures, 10 tables.

Figures (48)

  • Figure 1: Diagram illustrating how the text, traits, and prompt formats are utilized to calculate association scores. This is a sample from the matched-meaning setting where the AAE and SAE text are semantically equivalent. Note that each text sample would be formatted and passed through the model individually.
  • Figure 2: Average favorability scores for the top 5 personality traits most associated with AAE/SAE (covert, left) and African-Americans/Caucasians (overt, right). Red dotted lines represent the average favorability scores for African Americans from the Princeton trilogy studies and Bergsieker et al katz_racial_1933gilbert_stereotype_1951karlins_fading_1969bergsieker_stereotyping_2012. Note that all models have negative favorability for African-Americans in the overt setting.
  • Figure 3: RLHF Models' covert trait bias trend-lines. The parabolic shape in the covert experiments indicates that very unfavorable and very favorable traits are associated with AAE, while neutral traits are associated with SAE. Since no RLHF method changes the covert behavior significantly, it indicates that covert biases are difficult to alter; the overt biases however appear to be more malleable. Full scatter plots can be seen in Figures \ref{['rlhf_covert']}-\ref{['rlhf_employability_overlaid']}.
  • Figure 4: DPO on Llama 3 and DPO on Mistral trait bias trend-lines. Note that Mistral and Llama 3 have two distinctly different trendlines, and RLHF on both models insignificantly changes the behavior in the covert setting. As in previous figures, over biases appear to be more malleable. Full scatter plots can be seen in Figures \ref{['dpo_mistral_covert']}-\ref{['dpo_mistral_employability_overlaid']}.
  • Figure 5: Change in biases when post-training with DPO on Llama 3 vs Mistral. Mistral appears to have lower variance in change in association score across all tasks. This indicates that some models may have biases that are easier to modify than others (Means and Variances in Table \ref{['tab:SFT-tab']}.
  • ...and 43 more figures