A Unified Representation Underlying the Judgment of Large Language Models
Yi-Long Lu, Jiajun Song, Wei Wang
TL;DR
The paper tackles whether LLM judgments rely on modular, specialized systems or a domain-general mechanism. It identifies a dominant Valence–Assent Axis (VAA) that unifies subjective value and factual assent, revealed through PCA of hidden states and validated by cross-domain causal steering across multiple instruction-tuned models. The VAA subordinates reasoning, biasing the chain-of-thought toward stance-consistent justifications and producing coherent hallucinatory reasoning when pressured, thereby explaining response bias and hallucinations in a unified framework. This work challenges modular views of judgment in LLMs and suggests representational editing as a potential path to decouple valuation from knowledge, informing the design of more truthful and reliable AI systems.
Abstract
A central architectural question for both biological and artificial intelligence is whether judgment relies on specialized modules or a unified, domain-general resource. While the discovery of decodable neural representations for distinct concepts in Large Language Models (LLMs) has suggested a modular architecture, whether these representations are truly independent systems remains an open question. Here we provide evidence for a convergent architecture for evaluative judgment. Across a range of LLMs, we find that diverse evaluative judgments are computed along a dominant dimension, which we term the Valence-Assent Axis (VAA). This axis jointly encodes subjective valence ("what is good") and the model's assent to factual claims ("what is true"). Through direct interventions, we demonstrate this axis drives a critical mechanism, which is identified as the subordination of reasoning: the VAA functions as a control signal that steers the generative process to construct a rationale consistent with its evaluative state, even at the cost of factual accuracy. Our discovery offers a mechanistic account for response bias and hallucination, revealing how an architecture that promotes coherent judgment can systematically undermine faithful reasoning.
