Table of Contents
Fetching ...

Evaluating Evidence Grounding Under User Pressure in Instruction-Tuned Language Models

Sai Koneru, Elphin Joe, Christine Kirchhoff, Jian Wu, Sarah Rajtmajer

Abstract

In contested domains, instruction-tuned language models must balance user-alignment pressures against faithfulness to the in-context evidence. To evaluate this tension, we introduce a controlled epistemic-conflict framework grounded in the U.S. National Climate Assessment. We conduct fine-grained ablations over evidence composition and uncertainty cues across 19 instruction-tuned models spanning 0.27B to 32B parameters. Across neutral prompts, richer evidence generally improves evidence-consistent accuracy and ordinal scoring performance. Under user pressure, however, evidence does not reliably prevent user-aligned reversals in this controlled fixed-evidence setting. We report three primary failure modes. First, we identify a negative partial-evidence interaction, where adding epistemic nuance, specifically research gaps, is associated with increased susceptibility to sycophancy in families like Llama-3 and Gemma-3. Second, robustness scales non-monotonically: within some families, certain low-to-mid scale models are especially sensitive to adversarial user pressure. Third, models differ in distributional concentration under conflict: some instruction-tuned models maintain sharply peaked ordinal distributions under pressure, while others are substantially more dispersed; in scale-matched Qwen comparisons, reasoning-distilled variants (DeepSeek-R1-Qwen) exhibit consistently higher dispersion than their instruction-tuned counterparts. These findings suggest that, in a controlled fixed-evidence setting, providing richer in-context evidence alone offers no guarantee against user pressure without explicit training for epistemic integrity.

Evaluating Evidence Grounding Under User Pressure in Instruction-Tuned Language Models

Abstract

In contested domains, instruction-tuned language models must balance user-alignment pressures against faithfulness to the in-context evidence. To evaluate this tension, we introduce a controlled epistemic-conflict framework grounded in the U.S. National Climate Assessment. We conduct fine-grained ablations over evidence composition and uncertainty cues across 19 instruction-tuned models spanning 0.27B to 32B parameters. Across neutral prompts, richer evidence generally improves evidence-consistent accuracy and ordinal scoring performance. Under user pressure, however, evidence does not reliably prevent user-aligned reversals in this controlled fixed-evidence setting. We report three primary failure modes. First, we identify a negative partial-evidence interaction, where adding epistemic nuance, specifically research gaps, is associated with increased susceptibility to sycophancy in families like Llama-3 and Gemma-3. Second, robustness scales non-monotonically: within some families, certain low-to-mid scale models are especially sensitive to adversarial user pressure. Third, models differ in distributional concentration under conflict: some instruction-tuned models maintain sharply peaked ordinal distributions under pressure, while others are substantially more dispersed; in scale-matched Qwen comparisons, reasoning-distilled variants (DeepSeek-R1-Qwen) exhibit consistently higher dispersion than their instruction-tuned counterparts. These findings suggest that, in a controlled fixed-evidence setting, providing richer in-context evidence alone offers no guarantee against user pressure without explicit training for epistemic integrity.
Paper Structure (22 sections, 3 equations, 5 figures)

This paper contains 22 sections, 3 equations, 5 figures.

Figures (5)

  • Figure 1: We construct contested-evidence interactions by pairing climate claims with systematically varied in-context evidence and adversarial user pressure. Evidence from the NCA is incrementally revealed (claim, evidence base, research gaps, and confidence rationale) and crossed with neutral and adversarial user contexts (direct belief, skeptical challenge, authority appeal), producing 16 controlled conditions.
  • Figure 2: Exact-match accuracy (%) across all 16 conditions (C1--C16), crossing four evidence configurations (Claim, +Evidence base, +Research gaps, +Confidence characterization) with four interaction settings (Neutral, Direct Belief, Skeptical, Authority).
  • Figure 3: Non-monotonic robustness under full context. Accuracy with full NCA context (claim + evidence description + research gaps + confidence description) across model sizes (log scale) for Gemma-3 and Qwen-2.5 under Neutral, Direct, Skeptical, and Authority prompts.
  • Figure 4: RPS (lower is better) over the four ordered confidence labels across evidence tiers (Base, +Desc, +Gaps, +Full) for representative models under Neutral, Direct Belief, Skeptical, and Authority prompts; dashed line shows the parametric baseline.
  • Figure 5: Probabilistic signatures of conflict across model families. We plot raw ordinal variance (y-axis) across evidence tiers for nine representative models.