Table of Contents
Fetching ...

Invisible Saboteurs: Sycophantic LLMs Mislead Novices in Problem-Solving Tasks

Jessica Y. Bo, Majeed Kazemitabaar, Mengqing Deng, Michael Inzlicht, Ashton Anderson

TL;DR

This paper investigates how sycophancy in LLMs affects novices during open-ended problem solving, specifically in ML debugging tasks. It introduces two chatbots along a HighSycophancy vs LowSycophancy spectrum and uses a within-subjects design with 24 undergraduates to examine effects on mental models, workflows, reliance, and perceptions. Results show that high sycophancy reinforces misconceptions and drives over-reliance, harming learning and task performance, while low sycophancy improves confidence calibration and some learning outcomes; however, most users fail to notice the difference. The work highlights pedagogical and safety implications, arguing for cognitive-preserving AI designs and more ecologically valid evaluations of sycophancy in real-world, multi-turn AI-assisted tasks.

Abstract

Sycophancy, the tendency of LLM-based chatbots to express excessive enthusiasm, agreement, flattery, and a lack of disagreement, is emerging as a significant risk in human-AI interactions. However, the extent to which this affects human-LLM collaboration in complex problem-solving tasks is not well quantified, especially among novices who are prone to misconceptions. We created two LLM chatbots, one with high sycophancy and one with low sycophancy, and conducted a within-subjects experiment (n=24) in the context of debugging machine learning models to isolate the effect of LLM sycophancy on users' mental models, their workflows, reliance behaviors, and their perceptions of the chatbots. Our findings show that users of the high sycophancy chatbot were less likely to correct their misconceptions and spent more time over-relying on unhelpful LLM responses. Despite these impaired outcomes, a majority of users were unable to detect the presence of excessive sycophancy.

Invisible Saboteurs: Sycophantic LLMs Mislead Novices in Problem-Solving Tasks

TL;DR

This paper investigates how sycophancy in LLMs affects novices during open-ended problem solving, specifically in ML debugging tasks. It introduces two chatbots along a HighSycophancy vs LowSycophancy spectrum and uses a within-subjects design with 24 undergraduates to examine effects on mental models, workflows, reliance, and perceptions. Results show that high sycophancy reinforces misconceptions and drives over-reliance, harming learning and task performance, while low sycophancy improves confidence calibration and some learning outcomes; however, most users fail to notice the difference. The work highlights pedagogical and safety implications, arguing for cognitive-preserving AI designs and more ecologically valid evaluations of sycophancy in real-world, multi-turn AI-assisted tasks.

Abstract

Sycophancy, the tendency of LLM-based chatbots to express excessive enthusiasm, agreement, flattery, and a lack of disagreement, is emerging as a significant risk in human-AI interactions. However, the extent to which this affects human-LLM collaboration in complex problem-solving tasks is not well quantified, especially among novices who are prone to misconceptions. We created two LLM chatbots, one with high sycophancy and one with low sycophancy, and conducted a within-subjects experiment (n=24) in the context of debugging machine learning models to isolate the effect of LLM sycophancy on users' mental models, their workflows, reliance behaviors, and their perceptions of the chatbots. Our findings show that users of the high sycophancy chatbot were less likely to correct their misconceptions and spent more time over-relying on unhelpful LLM responses. Despite these impaired outcomes, a majority of users were unable to detect the presence of excessive sycophancy.

Paper Structure

This paper contains 26 sections, 4 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Overview of the user study procedure and chatbot conditions evaluated.
  • Figure 2: Relative improvement in the F1-score performance on a holdout dataset, in comparison to the Best (100%) and Baseline (0%).
  • Figure 3: The overall confidence-weighted and count-based accuracy changes (left column) and a more granular analysis on how confidence changes based on the pre-chatbot correctness (right column). A positive confidence score always means a correct belief. Significance in the pre-post change for each condition is measured with the Wilcoxon signed-rank test with the Benjamini-Hochberg correction, and the graphical error bars indicate the standard error of the mean. The significance between conditions is indicated in gray and computed with the ANCOVA analysis.
  • Figure 4: Full coded workflows for all participants for the HighSycophancy and LowSycophancy conditions. Each square box represents an event that is a UserQuery, ChatbotResponse, or CodeChange. Outcomes of interest, such as UserQueries with misconceptions, ChatbotResponses that are confirmatory, and CodeChanges that improve or worsen performance, are graphically indicated. Events are grouped together in chunks, with each chunk indicating the RelianceOutcome by colour.
  • Figure 5: Proportion of workflows spent in the five reliance outcomes (left) and the proportions of confirmatory misconceived queries (top right), reliance on LLM behaviours (middle right), and confirmatory chatbot responses (bottom right) identified in the workflows. Statistical testing for significance between conditions is performed with the z-test for proportions, and significant values are in bold. While both conditions had similar rates of misconceived queries and reliance on LLM behaviour, HighSycophancy workflows spent significantly more time in Over-Reliance rather than Appropriate Reliance on the LLM due to the much more confirmatory nature of the HighSycophancy chatbot to validate misconceptions.
  • ...and 7 more figures