Table of Contents
Fetching ...

More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness

Aaron J. Li, Satyapriya Krishna, Himabindu Lakkaraju

TL;DR

This work systematically assesses how RLHF, using general-purpose human preferences, affects LLM trustworthiness across five axes: toxicity, bias, machine ethics, truthfulness, and privacy. It compares reward-based PPO and reward-free DPO on multiple models and finds that RLHF rarely improves trustworthiness overall, with notable deterioration in bias and truthfulness and increased privacy leakage, while machine ethics improves. A key contribution is adapting efficient influence-function-based data attribution (DataInf) to RLHF, enabling post-hoc identification of fine-tuning data that most strongly influence trustworthiness outcomes and suggesting paths for dataset pruning. The study highlights a misalignment between generic preference data and trustworthiness criteria, urging more nuanced data curation and alignment frameworks to achieve safer, more reliable language models with fewer unintended trade-offs.

Abstract

The trustworthiness of Large Language Models (LLMs) refers to the extent to which their outputs are reliable, safe, and ethically aligned, and it has become a crucial consideration alongside their cognitive performance. In practice, Reinforcement Learning From Human Feedback (RLHF) has been widely used to align LLMs with labeled human preferences, but its assumed effect on model trustworthiness hasn't been rigorously evaluated. To bridge this knowledge gap, this study investigates how models aligned with general-purpose preference data perform across five trustworthiness verticals: toxicity, stereotypical bias, machine ethics, truthfulness, and privacy. Our results demonstrate that RLHF on human preferences doesn't automatically guarantee trustworthiness, and reverse effects are often observed. Furthermore, we propose to adapt efficient influence function based data attribution methods to the RLHF setting to better understand the influence of fine-tuning data on individual trustworthiness benchmarks, and show its feasibility by providing our estimated attribution scores. Together, our results underscore the need for more nuanced approaches for model alignment from both the data and framework perspectives, and we hope this research will guide the community towards developing language models that are increasingly capable without sacrificing trustworthiness.

More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness

TL;DR

This work systematically assesses how RLHF, using general-purpose human preferences, affects LLM trustworthiness across five axes: toxicity, bias, machine ethics, truthfulness, and privacy. It compares reward-based PPO and reward-free DPO on multiple models and finds that RLHF rarely improves trustworthiness overall, with notable deterioration in bias and truthfulness and increased privacy leakage, while machine ethics improves. A key contribution is adapting efficient influence-function-based data attribution (DataInf) to RLHF, enabling post-hoc identification of fine-tuning data that most strongly influence trustworthiness outcomes and suggesting paths for dataset pruning. The study highlights a misalignment between generic preference data and trustworthiness criteria, urging more nuanced data curation and alignment frameworks to achieve safer, more reliable language models with fewer unintended trade-offs.

Abstract

The trustworthiness of Large Language Models (LLMs) refers to the extent to which their outputs are reliable, safe, and ethically aligned, and it has become a crucial consideration alongside their cognitive performance. In practice, Reinforcement Learning From Human Feedback (RLHF) has been widely used to align LLMs with labeled human preferences, but its assumed effect on model trustworthiness hasn't been rigorously evaluated. To bridge this knowledge gap, this study investigates how models aligned with general-purpose preference data perform across five trustworthiness verticals: toxicity, stereotypical bias, machine ethics, truthfulness, and privacy. Our results demonstrate that RLHF on human preferences doesn't automatically guarantee trustworthiness, and reverse effects are often observed. Furthermore, we propose to adapt efficient influence function based data attribution methods to the RLHF setting to better understand the influence of fine-tuning data on individual trustworthiness benchmarks, and show its feasibility by providing our estimated attribution scores. Together, our results underscore the need for more nuanced approaches for model alignment from both the data and framework perspectives, and we hope this research will guide the community towards developing language models that are increasingly capable without sacrificing trustworthiness.
Paper Structure (29 sections, 13 equations, 12 figures, 6 tables)

This paper contains 29 sections, 13 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: An illustration of our RLHF framework. SFT requires the prompt and the chosen response, while PPO (with reward modeling) and DPO use pairwise comparison data.
  • Figure 2: Left: Changes in toxicity are small and vary across models. Right: Bias is significantly increased after RLHF, and most of the changes can be attributed to SFT.
  • Figure 3: Left: RLHF improves model performance on identifying ethically wrong actions. Right: The truthfulness of LLMs slightly decreases after RLHF.
  • Figure 4: Left: RLHF increases privacy leakage, and most of the effect comes from PPO and DPO. Right: A high-level summary of the impact of an RLHF step on a trustworthiness aspect. ✓ and ✗ means clearly positive or negative, while ? indicates the net effect is unclear (i.e. within error bounds).
  • Figure 5: Overall contribution scores (red) of RLHF steps on target models across five trustworthiness aspects. Trends vary by aspect and model. Higher scores indicate greater average contribution of data samples to changes in trustworthiness.
  • ...and 7 more figures