Table of Contents
Fetching ...

How Language Models Conflate Logical Validity with Plausibility: A Representational Analysis of Content Effects

Leonardo Bertolazzi, Sandro Pezzelle, Raffaelle Bernardi

TL;DR

This paper advances an interpretability-based account for why content effects arise in LLM reasoning by showing that validity and plausibility are linearly encoded and highly aligned in internal representations. Through steering experiments across multiple models and prompting styles, the authors demonstrate that these two abstract concepts are not only controllable via single directions but also causally influence each other, explaining behaviorally observed biases. They further present a training-free debiasing intervention that disentangles validity from plausibility, reducing content effects and improving accuracy. The findings suggest that representational interventions can yield more logical, trustworthy reasoning in LLMs and open avenues for applying similar analyses to other cognitive biases.

Abstract

Both humans and large language models (LLMs) exhibit content effects: biases in which the plausibility of the semantic content of a reasoning problem influences judgments regarding its logical validity. While this phenomenon in humans is best explained by the dual-process theory of reasoning, the mechanisms behind content effects in LLMs remain unclear. In this work, we address this issue by investigating how LLMs encode the concepts of validity and plausibility within their internal representations. We show that both concepts are linearly represented and strongly aligned in representational geometry, leading models to conflate plausibility with validity. Using steering vectors, we demonstrate that plausibility vectors can causally bias validity judgements, and vice versa, and that the degree of alignment between these two concepts predicts the magnitude of behavioral content effects across models. Finally, we construct debiasing vectors that disentangle these concepts, reducing content effects and improving reasoning accuracy. Our findings advance understanding of how abstract logical concepts are represented in LLMs and highlight representational interventions as a path toward more logical systems.

How Language Models Conflate Logical Validity with Plausibility: A Representational Analysis of Content Effects

TL;DR

This paper advances an interpretability-based account for why content effects arise in LLM reasoning by showing that validity and plausibility are linearly encoded and highly aligned in internal representations. Through steering experiments across multiple models and prompting styles, the authors demonstrate that these two abstract concepts are not only controllable via single directions but also causally influence each other, explaining behaviorally observed biases. They further present a training-free debiasing intervention that disentangles validity from plausibility, reducing content effects and improving accuracy. The findings suggest that representational interventions can yield more logical, trustworthy reasoning in LLMs and open avenues for applying similar analyses to other cognitive biases.

Abstract

Both humans and large language models (LLMs) exhibit content effects: biases in which the plausibility of the semantic content of a reasoning problem influences judgments regarding its logical validity. While this phenomenon in humans is best explained by the dual-process theory of reasoning, the mechanisms behind content effects in LLMs remain unclear. In this work, we address this issue by investigating how LLMs encode the concepts of validity and plausibility within their internal representations. We show that both concepts are linearly represented and strongly aligned in representational geometry, leading models to conflate plausibility with validity. Using steering vectors, we demonstrate that plausibility vectors can causally bias validity judgements, and vice versa, and that the degree of alignment between these two concepts predicts the magnitude of behavioral content effects across models. Finally, we construct debiasing vectors that disentangle these concepts, reducing content effects and improving reasoning accuracy. Our findings advance understanding of how abstract logical concepts are represented in LLMs and highlight representational interventions as a path toward more logical systems.

Paper Structure

This paper contains 36 sections, 7 equations, 23 figures, 4 tables.

Figures (23)

  • Figure 1: Validity and plausibility configurations. Illustrative examples of valid and invalid syllogisms with plausible and implausible conclusions. Here, plausible indicates that the conclusion is true in the real world, whereas implausible indicates that it is false.
  • Figure 2: Representational analysis of validity and plausibility concepts. (a) Steering power (SP) of validity and plausibility vectors applied at different hidden layers of Qwen2.5-32B-Instruct. The region in yellow highlights layers with $\mathrm{SP} > 0.75$. Validity and plausibility steering vectors show high SP at similar layers using both zero-shot and CoT prompting. (b) 3D PCA projection of hidden states from layer 50 of Qwen2.5-32B-Instruct in the zero-shot setting showing four distinct clusters corresponding to model predictions (valid/invalid, true/false). The parallel geometric structure between true/false and valid/invalid clusters suggests shared representational directions for plausibility and validity. (c) Average cosine similarity between the validity vector and vectors for the concepts of plausibility, hypernymy, and harmlessness across all layers for both models under zero-shot and CoT prompting. High validity-plausibility alignment (0.53 to 0.64) contrasts with low alignment for other concepts (0.10 to 0.13 and -0.12 to -0.17), confirming specific representational entanglement.
  • Figure 3: Cross-task steering. Average steering power (SP) of plausibility steering vectors when applied during the logical validity classification task ("plausibility $\rightarrow$ validity"), and vice versa ("validity $\rightarrow$ plausibility"), for Qwen2.5-32B-Instruct and Qwen3-14B, under both zero-shot and CoT prompting.
  • Figure 4: Mixed-effects regression. Relationship between average plausibility–validity similarity and content effect across model–prompt pairs. Points are colored by prompting style (zero-shot vs. CoT). As similarity increases, content effects generally increase, and zero-shot prompts tend to produce higher content effects than CoT prompts at comparable similarity levels.
  • Figure 5: Comparison of prompts used in the logical validity (bottom) and plausibility (top) classification tasks. The prompts contain an example for illustrative purposes. For models from the Qwen-3 family, we additionally included the string "Keep your thinking concise, avoid over-explaining, and reach a solution efficiently." right after the sentence "Think step by step and reason before answering" to induce the model to use lower thinking effort and limit the computational requirements of running inference.
  • ...and 18 more figures