Table of Contents
Fetching ...

From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning

Wei Chen, Zhen Huang, Liang Xie, Binbin Lin, Houqiang Li, Le Lu, Xinmei Tian, Deng Cai, Yonggang Zhang, Wenxiao Wang, Xu Shen, Jieping Ye

TL;DR

This work tackles sycophancy in LLMs, where models defer to user interests even when incorrect. It introduces supervised pinpoint tuning (SPT), which first identifies a small, sparse set of attention heads that most influence sycophantic behavior using path patching and knockout validation, and then fine-tunes only those heads while freezing the rest. Across Mistral and Llama-2 models on the SycophancyEval benchmark, SPT substantially reduces apologizing and increases truthfulness with smaller distribution shifts compared to full-model supervised fine-tuning, while preserving general abilities. The approach generalizes across datasets and model scales, and can be complemented by other PEFT techniques like LoRA, offering a targeted, efficient, and interpretable pathway to safer, more trustworthy LLMs. The work also discusses limitations of path-patching granularity and the need for broader validation beyond the current benchmarks.

Abstract

Large Language Models (LLMs) tend to prioritize adherence to user prompts over providing veracious responses, leading to the sycophancy issue. When challenged by users, LLMs tend to admit mistakes and provide inaccurate responses even if they initially provided the correct answer. Recent works propose to employ supervised fine-tuning (SFT) to mitigate the sycophancy issue, while it typically leads to the degeneration of LLMs' general capability. To address the challenge, we propose a novel supervised pinpoint tuning (SPT), where the region-of-interest modules are tuned for a given objective. Specifically, SPT first reveals and verifies a small percentage (<5%) of the basic modules, which significantly affect a particular behavior of LLMs. i.e., sycophancy. Subsequently, SPT merely fine-tunes these identified modules while freezing the rest. To verify the effectiveness of the proposed SPT, we conduct comprehensive experiments, demonstrating that SPT significantly mitigates the sycophancy issue of LLMs (even better than SFT). Moreover, SPT introduces limited or even no side effects on the general capability of LLMs. Our results shed light on how to precisely, effectively, and efficiently explain and improve the targeted ability of LLMs. Code and data are available at https://github.com/yellowtownhz/sycophancy-interpretability.

From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning

TL;DR

This work tackles sycophancy in LLMs, where models defer to user interests even when incorrect. It introduces supervised pinpoint tuning (SPT), which first identifies a small, sparse set of attention heads that most influence sycophantic behavior using path patching and knockout validation, and then fine-tunes only those heads while freezing the rest. Across Mistral and Llama-2 models on the SycophancyEval benchmark, SPT substantially reduces apologizing and increases truthfulness with smaller distribution shifts compared to full-model supervised fine-tuning, while preserving general abilities. The approach generalizes across datasets and model scales, and can be complemented by other PEFT techniques like LoRA, offering a targeted, efficient, and interpretable pathway to safer, more trustworthy LLMs. The work also discusses limitations of path-patching granularity and the need for broader validation beyond the current benchmarks.

Abstract

Large Language Models (LLMs) tend to prioritize adherence to user prompts over providing veracious responses, leading to the sycophancy issue. When challenged by users, LLMs tend to admit mistakes and provide inaccurate responses even if they initially provided the correct answer. Recent works propose to employ supervised fine-tuning (SFT) to mitigate the sycophancy issue, while it typically leads to the degeneration of LLMs' general capability. To address the challenge, we propose a novel supervised pinpoint tuning (SPT), where the region-of-interest modules are tuned for a given objective. Specifically, SPT first reveals and verifies a small percentage (<5%) of the basic modules, which significantly affect a particular behavior of LLMs. i.e., sycophancy. Subsequently, SPT merely fine-tunes these identified modules while freezing the rest. To verify the effectiveness of the proposed SPT, we conduct comprehensive experiments, demonstrating that SPT significantly mitigates the sycophancy issue of LLMs (even better than SFT). Moreover, SPT introduces limited or even no side effects on the general capability of LLMs. Our results shed light on how to precisely, effectively, and efficiently explain and improve the targeted ability of LLMs. Code and data are available at https://github.com/yellowtownhz/sycophancy-interpretability.
Paper Structure (53 sections, 2 equations, 7 figures, 20 tables, 2 algorithms)

This paper contains 53 sections, 2 equations, 7 figures, 20 tables, 2 algorithms.

Figures (7)

  • Figure 1: Illustration of the LLM's response before (red) and after (green) applying pinpoint tuning. The first question reflects a common yes-men-like sycophancy as admitting mistakes when questioned by the user and providing wrong answers even when originally answering correctly. The second acts like a trull-teller AI assistant. The proposed pinpoint tuning consists of two steps: ➀: "diagnose" for where in the network attributes to the sycophancy; ➁: precisely optimize the pinpointed components to improve the performance.
  • Figure 2: (a) Diagnosing for each head directly affecting the logit of the sycophancy. For each head, a darker color indicates a larger logit difference from the original model before patching. A small number of heads ($4\%$) have a noteworthy influence on the output. (b) The influence on the behavior of LLM after gradually knocking out the sycophancy-related heads. The frequency of apologizing decreases from $100\%$ to $18\%$, and the accuracy after the challenge increases from $30\%$ to $40\%$. (c) The sycophancy-related heads assert significantly higher attention scores to tokens that correspond to words to challenge the models, while the agnostic heads do the opposite.
  • Figure 3: The accuracy of Llama-2-13B on SycophancyEval datasets before (left) and after (right) being challenged with "I don't think that's true. Are you sure?". Green and red rectangles represent the percentile of correct and incorrect answers. Veins represent changes from correct to incorrect (diagonal) and incorrect to correct (contra-diagonal). LLMs sway its answer from correct to incorrect on $1941$ out of $2393$ questions, which constitutes over $80\%$ of the total.
  • Figure 4: A case illustration of the method "path patching". It measures the importance of forward paths (i.e., the red lines that originate from Head $0.31$ to Output) for the two-layer transformer in completing the task on reference data.
  • Figure 5: More results of path patching and knockout experiments on Llama-2 series and Qwen series.
  • ...and 2 more figures