Table of Contents
Fetching ...

In Machina N400: Pinpointing Where a Causal Language Model Detects Semantic Violations

Christos-Nikolaos Zacharopoulos, Revekka Kyriakoglou

TL;DR

This study asks where a causal transformer detects semantic violations during sentence processing by analyzing layer-wise hidden states in Phi-2 (2.7B). Through layer-wise linear decoding and representational-dimension metrics, the authors find a late, cluster-wide decoding peak around layers $18$–$30$, with a maximal signal near layer $22$, and observe a biphasic expansion–contraction of representational dimensionality across layers. The findings echo psycholinguistic theories that semantic integration occurs after structural analysis, suggesting a convergent processing order between artificial transformers and human reading. The work highlights the potential for brain–model comparisons to illuminate latent computations while acknowledging limitations in cross-model generality and the need for richer multimodal validation.

Abstract

How and where does a transformer notice that a sentence has gone semantically off the rails? To explore this question, we evaluated the causal language model (phi-2) using a carefully curated corpus, with sentences that concluded plausibly or implausibly. Our analysis focused on the hidden states sampled at each model layer. To investigate how violations are encoded, we utilized two complementary probes. First, we conducted a per-layer detection using a linear probe. Our findings revealed that a simple linear decoder struggled to distinguish between plausible and implausible endings in the lowest third of the model's layers. However, its accuracy sharply increased in the middle blocks, reaching a peak just before the top layers. Second, we examined the effective dimensionality of the encoded violation. Initially, the violation widens the representational subspace, followed by a collapse after a mid-stack bottleneck. This might indicate an exploratory phase that transitions into rapid consolidation. Taken together, these results contemplate the idea of alignment with classical psycholinguistic findings in human reading, where semantic anomalies are detected only after syntactic resolution, occurring later in the online processing sequence.

In Machina N400: Pinpointing Where a Causal Language Model Detects Semantic Violations

TL;DR

This study asks where a causal transformer detects semantic violations during sentence processing by analyzing layer-wise hidden states in Phi-2 (2.7B). Through layer-wise linear decoding and representational-dimension metrics, the authors find a late, cluster-wide decoding peak around layers , with a maximal signal near layer , and observe a biphasic expansion–contraction of representational dimensionality across layers. The findings echo psycholinguistic theories that semantic integration occurs after structural analysis, suggesting a convergent processing order between artificial transformers and human reading. The work highlights the potential for brain–model comparisons to illuminate latent computations while acknowledging limitations in cross-model generality and the need for richer multimodal validation.

Abstract

How and where does a transformer notice that a sentence has gone semantically off the rails? To explore this question, we evaluated the causal language model (phi-2) using a carefully curated corpus, with sentences that concluded plausibly or implausibly. Our analysis focused on the hidden states sampled at each model layer. To investigate how violations are encoded, we utilized two complementary probes. First, we conducted a per-layer detection using a linear probe. Our findings revealed that a simple linear decoder struggled to distinguish between plausible and implausible endings in the lowest third of the model's layers. However, its accuracy sharply increased in the middle blocks, reaching a peak just before the top layers. Second, we examined the effective dimensionality of the encoded violation. Initially, the violation widens the representational subspace, followed by a collapse after a mid-stack bottleneck. This might indicate an exploratory phase that transitions into rapid consolidation. Taken together, these results contemplate the idea of alignment with classical psycholinguistic findings in human reading, where semantic anomalies are detected only after syntactic resolution, occurring later in the online processing sequence.

Paper Structure

This paper contains 18 sections, 9 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Overview of the analysis pipeline.(A) Stimuli consist of 1520 matched sentence pairs (760 plausible, 760 implausible), each differing only in the final noun while preserving a fixed syntactic template. (B) Each sentence is processed once through all 32 layers of the Phi-2 transformer. For every layer $l$, the hidden-state matrix $H^{(l)}(x) \in \mathbb{R}^{T_x \times d}$ is extracted and reshaped into a vector $h^{(l)}(x) = \operatorname{vec}(H^{(l)}(x))$. Activations are normalized within each layer across sentences prior to feature computation. (C) Two complementary layer-wise metrics are computed. (i) A logistic-regression decoder trained on the mean activation $\mu^{(l)}(x)$ yields a decoding score $\mathrm{AUC}_l$, quantifying how well layer $l$ separates plausible from implausible sentences. (ii) The representational dimensionality of each layer is estimated using the participation ratio $\mathrm{PR}_l$, derived from the eigenvalues of the covariance matrix of $h^{(l)}(x)$ across sentences. Together, these analyses characterize how semantic-violation information becomes linearly decodable and how the effective dimensionality of the model’s internal representations evolves across layers.
  • Figure 2: Layer-wise decoding of semantic anomalies in Phi-2. Mean ROC–AUC (blue; ±1 SEM shaded) for a logistic classifier that distinguishes plausible from violation endings at each encoding layer. The red dashed line shows chance performance (0.5). The grey shading marks the only cluster of consecutive layers (18–30) whose AUC reliably exceeded chance after cluster-based permutation correction ($p<0.001$).
  • Figure 3: Effective dimensionality of hidden states.Top: Participation ratio (PR) for violation (red) and control (green) sentences across the 32 encoding layers of Phi-2. Bottom: Difference trace (violation – control) relative to the zero baseline (grey dashed line). Violations initially occupy a higher‐dimensional subspace (layers 1–6), converge with controls around the mid-stack bottleneck (layer 12, grey vertical tick), and become marginally more compressed in deeper layers. These dynamics suggest an early expansion of representational space to accommodate unexpected input, followed by a gradual re-integration as contextual constraints accumulate.