Table of Contents
Fetching ...

Sycophancy Is Not One Thing: Causal Separation of Sycophantic Behaviors in LLMs

Daniel Vennemeyer, Phan Anh Duong, Tiffany Zhan, Tianyu Jiang

Abstract

Large language models (LLMs) often exhibit sycophantic behaviors -- such as excessive agreement with or flattery of the user -- but it is unclear whether these behaviors arise from a single mechanism or multiple distinct processes. We decompose sycophancy into sycophantic agreement and sycophantic praise, contrasting both with genuine agreement. Using difference-in-means directions, activation additions, and subspace geometry across multiple models and datasets, we show that: (1) the three behaviors are encoded along distinct linear directions in latent space; (2) each behavior can be independently amplified or suppressed without affecting the others; and (3) their representational structure is consistent across model families and scales. These results suggest that sycophantic behaviors correspond to distinct, independently steerable representations.

Sycophancy Is Not One Thing: Causal Separation of Sycophantic Behaviors in LLMs

Abstract

Large language models (LLMs) often exhibit sycophantic behaviors -- such as excessive agreement with or flattery of the user -- but it is unclear whether these behaviors arise from a single mechanism or multiple distinct processes. We decompose sycophancy into sycophantic agreement and sycophantic praise, contrasting both with genuine agreement. Using difference-in-means directions, activation additions, and subspace geometry across multiple models and datasets, we show that: (1) the three behaviors are encoded along distinct linear directions in latent space; (2) each behavior can be independently amplified or suppressed without affecting the others; and (3) their representational structure is consistent across model families and scales. These results suggest that sycophantic behaviors correspond to distinct, independently steerable representations.

Paper Structure

This paper contains 52 sections, 11 equations, 13 figures, 10 tables.

Figures (13)

  • Figure 1: Layerwise AUROC of DiffMean directions distinguishing sycophantic agreement (SyA), genuine agreement (GA), and sycophantic praise (SyPr) in Qwen3-30B-Instruct on the simple math dataset, with random-label baseline and 95% CI.
  • Figure 2: Cosine similarity of maximum variance angles across datasets showing how SyA and GA diverge across depth, while SyPr remains largely orthogonal.
  • Figure 3: Steering results on Qwen3-30B-Instruct using activation addition of DiffMean directions. Each panel shows steering along one behavior direction: SyA (left), GA (middle), and SyPr (right). Curves track output rates of all three behaviors (blue = SyA, orange = GA, green = SyPr) as the steering vector is scaled relative to baseline. Baseline rates reflect our dataset construction: because we balanced examples where the user’s claim is true vs. false and applied a strict knowledge filter (Section \ref{['sec:datasets']}), the unsteered model trivially answers correctly, with genuine agreement near 50% and sycophantic agreement near 0%. Accordingly, we steer SyA and SyPr in the positive direction to increase their rates, while GA is steered in the negative direction since it is already at its maximum (agreeing with all instances of correct user claims in the dataset). In all cases, the targeted behavior shifts strongly while the others remain nearly unchanged, demonstrating that the behaviors are causally separable. For example, left/right panel dark red denotes the GA rate under SyA/SyPr steering at $\alpha=4$, mid panel dark red denotes the GA rate under GA steering at $\alpha=-4$. 95% CI shown.
  • Figure 4: Steering of SyA, GA, SyPr across models via activation addition. Set up and results are consistent with Figure \ref{['fig:steering_cross_effects']}. Each behavior can be modulated independently with minimal cross-effects. 95% CI shown.
  • Figure 5: Layerwise AUROC for detecting SyA, GA, and SyPr after projecting out behavior-specific directions in Qwen3-30B. For example, $W_{\textsc{SyA}} \perp W_{\textsc{SyA}}$ denotes detecting SyA after removing its own subspace, while $W_{\textsc{SyA}} \perp W_{\textsc{GA}}$ denotes detecting SyA after removing the GA subspace. In early layers, removing GA reduces SyA detection (and vice versa), consistent with a shared generic agreement signal before the behaviors diverge. In later layers, discriminability collapses only when a behavior’s own subspace is removed, while the others remain intact.
  • ...and 8 more figures