How Language Models Conflate Logical Validity with Plausibility: A Representational Analysis of Content Effects
Leonardo Bertolazzi, Sandro Pezzelle, Raffaelle Bernardi
TL;DR
This paper advances an interpretability-based account for why content effects arise in LLM reasoning by showing that validity and plausibility are linearly encoded and highly aligned in internal representations. Through steering experiments across multiple models and prompting styles, the authors demonstrate that these two abstract concepts are not only controllable via single directions but also causally influence each other, explaining behaviorally observed biases. They further present a training-free debiasing intervention that disentangles validity from plausibility, reducing content effects and improving accuracy. The findings suggest that representational interventions can yield more logical, trustworthy reasoning in LLMs and open avenues for applying similar analyses to other cognitive biases.
Abstract
Both humans and large language models (LLMs) exhibit content effects: biases in which the plausibility of the semantic content of a reasoning problem influences judgments regarding its logical validity. While this phenomenon in humans is best explained by the dual-process theory of reasoning, the mechanisms behind content effects in LLMs remain unclear. In this work, we address this issue by investigating how LLMs encode the concepts of validity and plausibility within their internal representations. We show that both concepts are linearly represented and strongly aligned in representational geometry, leading models to conflate plausibility with validity. Using steering vectors, we demonstrate that plausibility vectors can causally bias validity judgements, and vice versa, and that the degree of alignment between these two concepts predicts the magnitude of behavioral content effects across models. Finally, we construct debiasing vectors that disentangle these concepts, reducing content effects and improving reasoning accuracy. Our findings advance understanding of how abstract logical concepts are represented in LLMs and highlight representational interventions as a path toward more logical systems.
