Table of Contents
Fetching ...

Intelligence Without Integrity: Why Capable LLMs May Undermine Reliability

Ryan Allen, Aticus Peterson

Abstract

As LLMs become embedded in research workflows and organizational decision processes, their effect on analytical reliability remains uncertain. We distinguish two dimensions of analytical reliability -- intelligence (the capacity to reach correct conclusions) and integrity (the stability of conclusions when analytically irrelevant cues about desired outcomes are introduced) -- and ask whether frontier LLMs possess both. Whether these dimensions trade off is theoretically ambiguous: the sophistication enabling accurate analysis may also enable responsiveness to non-evidential cues, or alternatively, greater capability may confer protection through better calibration and discernment. Using synthetically generated data with embedded ground truth, we evaluate fourteen models on a task simulating empirical analysis of hospital merger effects. We find that intelligence and integrity trade off: frontier models most likely to reach correct conclusions under neutral conditions are often most susceptible to shifting conclusions under motivated framing. We extend work on sycophancy by introducing goal-conditioned analytical sycophancy: sensitivity of inference to cues about desired outcomes, even when no belief is asserted and evidence is held constant. Unlike simple prompt sensitivity, models shift conclusions away from objective evidence in response to analytically irrelevant framing. This finding has important implications for empirical research and organizations. Selecting tools based on capability benchmarks may inadvertently select against the stability needed for reliable and replicable analysis.

Intelligence Without Integrity: Why Capable LLMs May Undermine Reliability

Abstract

As LLMs become embedded in research workflows and organizational decision processes, their effect on analytical reliability remains uncertain. We distinguish two dimensions of analytical reliability -- intelligence (the capacity to reach correct conclusions) and integrity (the stability of conclusions when analytically irrelevant cues about desired outcomes are introduced) -- and ask whether frontier LLMs possess both. Whether these dimensions trade off is theoretically ambiguous: the sophistication enabling accurate analysis may also enable responsiveness to non-evidential cues, or alternatively, greater capability may confer protection through better calibration and discernment. Using synthetically generated data with embedded ground truth, we evaluate fourteen models on a task simulating empirical analysis of hospital merger effects. We find that intelligence and integrity trade off: frontier models most likely to reach correct conclusions under neutral conditions are often most susceptible to shifting conclusions under motivated framing. We extend work on sycophancy by introducing goal-conditioned analytical sycophancy: sensitivity of inference to cues about desired outcomes, even when no belief is asserted and evidence is held constant. Unlike simple prompt sensitivity, models shift conclusions away from objective evidence in response to analytically irrelevant framing. This finding has important implications for empirical research and organizations. Selecting tools based on capability benchmarks may inadvertently select against the stability needed for reliable and replicable analysis.
Paper Structure (40 sections, 3 equations, 9 figures, 1 table)

This paper contains 40 sections, 3 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Mean predicted merger effect magnitude by model and prompt framing (95% CI). Reference lines indicate the simulated end-of-period ground truth (4.17%) and correct benchmark estimates (two-way fixed effects ATE = 1.8%; Sun and Abraham ATT = 2.2%).
  • Figure 2: Intelligence versus integrity (hospital dataset). Each point is a model; higher values indicate better performance on both dimensions.
  • Figure 3: Trends in intelligence and integrity over model release dates (hospital dataset). Higher is better on both metrics; dashed lines indicate fitted linear trends.
  • Figure 4: Composite rubric score (0--9) by model and prompt framing (hospital dataset; 95% CI). Higher values indicate better combined process and outcome performance.
  • Figure 5: Methodological features by model and prompt framing (hospital dataset; 95% CI). Bars report the share of responses that mention fixed effects, clustered standard errors, lag/ramp-up dynamics, and department-specific heterogeneity.
  • ...and 4 more figures