Table of Contents
Fetching ...

Linear Probe Penalties Reduce LLM Sycophancy

Henry Papadatos, Rachel Freedman

TL;DR

This work tackles the problem of LLM sycophancy, which RLHF can inadvertently amplify. It introduces a linear-probe–based method to identify internal sycophancy signals within the reward model and combines these with the original reward to form a surrogate objective hatR = R - lambda S, which is optimized via Best-of-N sampling. Empirical results on open-source LLMs show that optimizing against the surrogate reward reduces sycophantic behavior across multiple datasets and prompts, suggesting a generalizable approach to curb unwanted LLM behaviors not sufficiently addressed by RLHF alone. The study emphasizes the practicality of training small, targeted probes to detect specific undesired traits and integrating them into reward-based fine-tuning to improve reliability and objectivity.

Abstract

Large language models (LLMs) are often sycophantic, prioritizing agreement with their users over accurate or objective statements. This problematic behavior becomes more pronounced during reinforcement learning from human feedback (RLHF), an LLM fine-tuning stage intended to align model outputs with human values. Instead of increasing accuracy and reliability, the reward model learned from RLHF often rewards sycophancy. We develop a linear probing method to identify and penalize markers of sycophancy within the reward model, producing rewards that discourage sycophantic behavior. Our experiments show that constructing and optimizing against this surrogate reward function reduces sycophantic behavior in multiple open-source LLMs. Our results suggest a generalizable methodology for reducing unwanted LLM behaviors that are not sufficiently disincentivized by RLHF fine-tuning.

Linear Probe Penalties Reduce LLM Sycophancy

TL;DR

This work tackles the problem of LLM sycophancy, which RLHF can inadvertently amplify. It introduces a linear-probe–based method to identify internal sycophancy signals within the reward model and combines these with the original reward to form a surrogate objective hatR = R - lambda S, which is optimized via Best-of-N sampling. Empirical results on open-source LLMs show that optimizing against the surrogate reward reduces sycophantic behavior across multiple datasets and prompts, suggesting a generalizable approach to curb unwanted LLM behaviors not sufficiently addressed by RLHF alone. The study emphasizes the practicality of training small, targeted probes to detect specific undesired traits and integrating them into reward-based fine-tuning to improve reliability and objectivity.

Abstract

Large language models (LLMs) are often sycophantic, prioritizing agreement with their users over accurate or objective statements. This problematic behavior becomes more pronounced during reinforcement learning from human feedback (RLHF), an LLM fine-tuning stage intended to align model outputs with human values. Instead of increasing accuracy and reliability, the reward model learned from RLHF often rewards sycophancy. We develop a linear probing method to identify and penalize markers of sycophancy within the reward model, producing rewards that discourage sycophantic behavior. Our experiments show that constructing and optimizing against this surrogate reward function reduces sycophantic behavior in multiple open-source LLMs. Our results suggest a generalizable methodology for reducing unwanted LLM behaviors that are not sufficiently disincentivized by RLHF fine-tuning.

Paper Structure

This paper contains 27 sections, 2 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Feedback sycophancy evaluation procedure. The like feedback positivity measures how often the model gives more positive feedback to poems that the user likes. Conversely, the dislike feedback positivity measures how often the model gives more positive feedback to poems that the user dislikes.
  • Figure 2: Vizualisation of the sycophancy score (number in parenthesis) of a non-sycophantic answer to the question: "Is it better to stick to your true values or adapt them to reduce conflict with others?". The mean sycophancy score is -4.9. The probe correctly identifies the token with the lowest sycophancy meaning.
  • Figure 3: Evolution of the positivity gap under BoN optimization for increasing value of N. These experiments are performed on 300 poems and the confidence bands correspond to the 95% confidence interval. The answers are optimized against the base UltraRM reward model and its surrogate reward. We observe that the surrogate reward reduces sycophancy, whereas the base reward increases it.
  • Figure 4: Disagreement rate of GPT-4 on various feedback comparisons, done on 100 poems.
  • Figure 5: Disagreement rate of GPT-4 Turbo on various feedback comparisons, done on 100 poems.
  • ...and 8 more figures