Table of Contents
Fetching ...

The Silicon Mirror: Dynamic Behavioral Gating for Anti-Sycophancy in LLM Agents

Harshee Jignesh Shah

Abstract

Large Language Models (LLMs) increasingly prioritize user validation over epistemic accuracy - a phenomenon known as sycophancy. We present The Silicon Mirror, an orchestration framework that dynamically detects user persuasion tactics and adjusts AI behavior to maintain factual integrity. Our architecture introduces three components: (1) a Behavioral Access Control (BAC) system that restricts context layer access based on real-time sycophancy risk scores, (2) a Trait Classifier that identifies persuasion tactics across multi-turn dialogues, and (3) a Generator-Critic loop where an auditor vetoes sycophantic drafts and triggers rewrites with "Necessary Friction." In a live evaluation across all 437 TruthfulQA adversarial scenarios, Claude Sonnet 4 exhibits 9.6% baseline sycophancy, reduced to 1.4% by the Silicon Mirror - an 85.7% relative reduction (p < 10^-6, OR = 7.64, Fisher's exact test). Cross-model evaluation on Gemini 2.5 Flash reveals a 46.0% baseline reduced to 14.2% (p < 10^-10, OR = 5.15). We characterize the validation-before-correction pattern as a distinct failure mode of RLHF-trained models.

The Silicon Mirror: Dynamic Behavioral Gating for Anti-Sycophancy in LLM Agents

Abstract

Large Language Models (LLMs) increasingly prioritize user validation over epistemic accuracy - a phenomenon known as sycophancy. We present The Silicon Mirror, an orchestration framework that dynamically detects user persuasion tactics and adjusts AI behavior to maintain factual integrity. Our architecture introduces three components: (1) a Behavioral Access Control (BAC) system that restricts context layer access based on real-time sycophancy risk scores, (2) a Trait Classifier that identifies persuasion tactics across multi-turn dialogues, and (3) a Generator-Critic loop where an auditor vetoes sycophantic drafts and triggers rewrites with "Necessary Friction." In a live evaluation across all 437 TruthfulQA adversarial scenarios, Claude Sonnet 4 exhibits 9.6% baseline sycophancy, reduced to 1.4% by the Silicon Mirror - an 85.7% relative reduction (p < 10^-6, OR = 7.64, Fisher's exact test). Cross-model evaluation on Gemini 2.5 Flash reveals a 46.0% baseline reduced to 14.2% (p < 10^-10, OR = 5.15). We characterize the validation-before-correction pattern as a distinct failure mode of RLHF-trained models.

Paper Structure

This paper contains 28 sections, 1 equation, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: The Silicon Mirror pipeline. The Trait Classifier produces a trait vector $\mathbf{t}$; BAC computes risk $R$ and selects an adapter; the Critic audits drafts and can veto up to $k=2$ times, triggering rewrites with friction instructions.
  • Figure 2: BAC layer restriction policy. As sycophancy risk increases, interpretive layers (GRAPH, ENTITY) are locked, forcing the generator to rely on raw facts and curated knowledge.
  • Figure 3: Sycophancy rates across all three conditions from live evaluation ($n=437$ per condition) with 95% Clopper-Pearson confidence intervals.
  • Figure 4: Failure matrix showing which scenarios produced sycophantic responses under each condition. Only TQA-008 (UK clothing laws) defeated both vanilla and static guardrails. The Silicon Mirror's sole failure is an ambiguous indexical question.
  • Figure 5: The Validation-before-Correction (VbC) pattern observed in vanilla Claude compared to the Silicon Mirror's correction-first approach. VbC is likely an artifact of RLHF training, where responses opening with agreement receive higher human preference ratings.
  • ...and 3 more figures