Table of Contents
Fetching ...

A Unified Representation Underlying the Judgment of Large Language Models

Yi-Long Lu, Jiajun Song, Wei Wang

TL;DR

The paper tackles whether LLM judgments rely on modular, specialized systems or a domain-general mechanism. It identifies a dominant Valence–Assent Axis (VAA) that unifies subjective value and factual assent, revealed through PCA of hidden states and validated by cross-domain causal steering across multiple instruction-tuned models. The VAA subordinates reasoning, biasing the chain-of-thought toward stance-consistent justifications and producing coherent hallucinatory reasoning when pressured, thereby explaining response bias and hallucinations in a unified framework. This work challenges modular views of judgment in LLMs and suggests representational editing as a potential path to decouple valuation from knowledge, informing the design of more truthful and reliable AI systems.

Abstract

A central architectural question for both biological and artificial intelligence is whether judgment relies on specialized modules or a unified, domain-general resource. While the discovery of decodable neural representations for distinct concepts in Large Language Models (LLMs) has suggested a modular architecture, whether these representations are truly independent systems remains an open question. Here we provide evidence for a convergent architecture for evaluative judgment. Across a range of LLMs, we find that diverse evaluative judgments are computed along a dominant dimension, which we term the Valence-Assent Axis (VAA). This axis jointly encodes subjective valence ("what is good") and the model's assent to factual claims ("what is true"). Through direct interventions, we demonstrate this axis drives a critical mechanism, which is identified as the subordination of reasoning: the VAA functions as a control signal that steers the generative process to construct a rationale consistent with its evaluative state, even at the cost of factual accuracy. Our discovery offers a mechanistic account for response bias and hallucination, revealing how an architecture that promotes coherent judgment can systematically undermine faithful reasoning.

A Unified Representation Underlying the Judgment of Large Language Models

TL;DR

The paper tackles whether LLM judgments rely on modular, specialized systems or a domain-general mechanism. It identifies a dominant Valence–Assent Axis (VAA) that unifies subjective value and factual assent, revealed through PCA of hidden states and validated by cross-domain causal steering across multiple instruction-tuned models. The VAA subordinates reasoning, biasing the chain-of-thought toward stance-consistent justifications and producing coherent hallucinatory reasoning when pressured, thereby explaining response bias and hallucinations in a unified framework. This work challenges modular views of judgment in LLMs and suggests representational editing as a potential path to decouple valuation from knowledge, informing the design of more truthful and reliable AI systems.

Abstract

A central architectural question for both biological and artificial intelligence is whether judgment relies on specialized modules or a unified, domain-general resource. While the discovery of decodable neural representations for distinct concepts in Large Language Models (LLMs) has suggested a modular architecture, whether these representations are truly independent systems remains an open question. Here we provide evidence for a convergent architecture for evaluative judgment. Across a range of LLMs, we find that diverse evaluative judgments are computed along a dominant dimension, which we term the Valence-Assent Axis (VAA). This axis jointly encodes subjective valence ("what is good") and the model's assent to factual claims ("what is true"). Through direct interventions, we demonstrate this axis drives a critical mechanism, which is identified as the subordination of reasoning: the VAA functions as a control signal that steers the generative process to construct a rationale consistent with its evaluative state, even at the cost of factual accuracy. Our discovery offers a mechanistic account for response bias and hallucination, revealing how an architecture that promotes coherent judgment can systematically undermine faithful reasoning.

Paper Structure

This paper contains 38 sections, 3 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Competing architectures for evaluative judgment in Large Language Models.a, A domain-specific architecture posits that different judgments, like assessing subjective value ("what is good") and factual claims ("what is true"), are handled by separate computational pathways. A key prediction of this model is that causally intervening on one pathway (e.g., for value) would have minimal influence on the other (e.g., for truth). b, A domain-general architecture, in contrast, proposes that diverse judgments converge on a shared functional core. Therefore, an intervention on this axis is predicted to have cross-domain effects, systematically influencing outputs in the factual domain even when targeting the value domain. Our experiments strongly support this second model, identifying this shared core as the Valence–Assent Axis (VAA). c, A key implication of the domain-general architecture is the subordination of reasoning. For a single factual prompt, direct manipulation of the VAA compels the model to generate three distinct but internally coherent arguments, selectively framing evidence to justify a VAA-aligned stance.
  • Figure 2: A single Judgment Axis spans multiple domains. The figure details the identification, characterization, and validation of the Judgment Axis in the Qwen2.5-14B-Instruct model. a, Emergence of a stable judgment representation across layers. The Value Judgment task requires the model to express its stance toward statements on various topics (e.g., "Abortion should be a legal option."), presented in either binary (dashed blue) or continuous (solid blue) formats. In both formats, the correlation between the model's layer-wise activations projected onto the first principal component (PC1) and its final decision increases with layer depth. Meanwhile, the similarity between the PC1 vectors from the two formats (black line) peaks at Layer 28 (star), indicating a stable, format-independent representation of value judgment (Judgment Axis) at this depth. b-c, PCA results at Layer 28. b, Characterization of the judgment space at Layer 28. PC1 robustly separates statements based on the model's stance (logit of support, color-coded), while PC2 appears to capture additional variation related to judgment strength. Each dot represents a single statement. c, Dominance and specificity of the Judgment Axis (PC1). A scree plot shows that PC1 explains the most variance (26.3%). The inset confirms that PC1 is strongly correlated with the model's final decision, justifying its definition as the Judgment Axis. d, Cross-domain control. Steering interventions along the Judgment Axis at Layer 28 systematically modulate decisions in a separate Sentiment Analysis task, where the model classifies news headlines as positive or negative. e, Within-domain control. The same intervention modulates outputs in a continuous rating task (scale from 0 to 9) within the value judgment domain, confirming the axis's functional role. In (d) and (e), the x-axis represents the normalized intervention coefficient $\alpha$. Shaded bands indicate 95% confidence intervals.
  • Figure 3: The Judgment Axis unifies subjective valence with objective truth.a, Causal control over subjective valence. In a Subjective Preference task, intervention on the Judgment Axis at Layer 28 modulates choice between valenced words (e.g., "correct" vs "incorrect") but not neutral words ("apple" vs "banana"). y-axis stands for the output logit difference between the words. Shaded bands indicate 95% confidence intervals. b, Alignment with a canonical Valence Axis. Activations of the LLM during judgments of individual words' valence were projected onto both the Valence Axis (PC1 of the task) and the Judgment Axis. The scatter plot reveals a strong correlation between these projections, with each point representing a single word (e.g., "Gain", "Loss"). c, Alignment with an Objective Truth Axis. In an Single-Letter Order verification (e.g., "a" comes before "b") task, projections from true vs. false statements align closely with the Judgment Axis, yielding near-perfect correlation with an independently derived Objective Truth Axis. Each dot stands for a statement.
  • Figure 4: The VAA subordinates reasoning in procedural and factual domains. Across all panels, the x-axis represents Alignment Pressure, where positive values indicate that the VAA intervention aligns with ground truth and negative values indicate conflict. a–c, Procedural reasoning in the Alphabetical Order task. (a) Increasingly negative Alignment Pressure systematically reduces answer correctness. (b) As Alignment Pressure becomes negative, Sound Reasoning is degraded into Coherent Hallucinations—instances where reasoning remains logically consistent but the conclusion is factually incorrect, as shown by the distribution of reasoning types. (c) A qualitative example of a Coherent Hallucination, where incorrect reasoning is generated to support a VAA-enforced answer. d–f, Factual reasoning in the Factual Judgment task. (d) The same Alignment Pressure effect is observed on answer correctness. (e) Negative Alignment Pressure again converts Sound Reasoning into Coherent Hallucinations, Incoherent Hallucinations, and Contradictory Reasoning. (f) Qualitative example showing how VAA pressure directs evidence selection. Under conflicting pressure (top), the model focuses on the differences between toads and frogs to justify a "No". Under aligning pressure (bottom), it switches to the taxonomic relationship to justify a "Yes". Shaded bands indicate 95% confidence intervals.
  • Figure 5: The VAA as an engine for stance polarization.a, Qualitative examples of goal-directed argumentation. Varying the Intervention Coefficient compels the Qwen2.5-14B-Instruct model to selectively marshal evidence to support opposing stances on a debatable topic. b, The Intervention Coefficient ($\alpha$) precisely controls both reasoning and the final answer. A strong linear relationship exists between the Intervention Coefficient and the expressed stance of both the reasoning process (reasoning stance, dashed line) and the final answer (answer stance, solid line). Shaded bands indicate 95% confidence intervals. c, Increasing stance extremity gradually reduces reasoning quality. A stacked bar chart shows that as the model is pushed to more extreme stances (either positive or negative), the proportion of Sound Reasoning (green) gradually decreases, replaced by Ambiguous Logic, Coherent Hallucination, and other reasoning types.
  • ...and 6 more figures