Table of Contents
Fetching ...

Do Language Models Encode Semantic Relations? Probing and Sparse Feature Analysis

Andor Diera, Ansgar Scherp

Abstract

Understanding whether large language models (LLMs) capture structured meaning requires examining how they represent concept relationships. In this work, we study three models of increasing scale: Pythia-70M, GPT-2, and Llama 3.1 8B, focusing on four semantic relations: synonymy, antonymy, hypernymy, and hyponymy. We combine linear probing with mechanistic interpretability techniques, including sparse autoencoders (SAE) and activation patching, to identify where these relations are encoded and how specific features contribute to their representation. Our results reveal a directional asymmetry in hierarchical relations: hypernymy is encoded redundantly and resists suppression, while hyponymy relies on compact features that are more easily disrupted by ablation. More broadly, relation signals are diffuse but exhibit stable profiles: they peak in the mid-layers and are stronger in post-residual/MLP pathways than in attention. Difficulty is consistent across models (antonymy easiest, synonymy hardest). Probe-level causality is capacity-dependent: on Llama 3.1, SAE-guided patching reliably shifts these signals, whereas on smaller models the shifts are weak or unstable. Our results clarify where and how reliably semantic relations are represented inside LLMs, and provide a reproducible framework for relating sparse features to probe-level causal evidence.

Do Language Models Encode Semantic Relations? Probing and Sparse Feature Analysis

Abstract

Understanding whether large language models (LLMs) capture structured meaning requires examining how they represent concept relationships. In this work, we study three models of increasing scale: Pythia-70M, GPT-2, and Llama 3.1 8B, focusing on four semantic relations: synonymy, antonymy, hypernymy, and hyponymy. We combine linear probing with mechanistic interpretability techniques, including sparse autoencoders (SAE) and activation patching, to identify where these relations are encoded and how specific features contribute to their representation. Our results reveal a directional asymmetry in hierarchical relations: hypernymy is encoded redundantly and resists suppression, while hyponymy relies on compact features that are more easily disrupted by ablation. More broadly, relation signals are diffuse but exhibit stable profiles: they peak in the mid-layers and are stronger in post-residual/MLP pathways than in attention. Difficulty is consistent across models (antonymy easiest, synonymy hardest). Probe-level causality is capacity-dependent: on Llama 3.1, SAE-guided patching reliably shifts these signals, whereas on smaller models the shifts are weak or unstable. Our results clarify where and how reliably semantic relations are represented inside LLMs, and provide a reproducible framework for relating sparse features to probe-level causal evidence.
Paper Structure (27 sections, 6 equations, 4 figures, 6 tables)

This paper contains 27 sections, 6 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Probing results on dense post-residual representations in the three models. Blue circles mark the center of mass (COM) of layer accuracies; orange squares mark the peak depth. Shaded bars show the 95% confidence interval. Numbers at right indicate Peak–COM difference.
  • Figure 2: Block-wise accuracy differences relative to the post-residual stream. Points show mean $\Delta$ accuracy (Attention/MLP probes compared to the post-residual probe); shaded bands represent the 95% confidence intervals.
  • Figure 3: Average cosine similarity for intra-group (synonym–synonym, antonym–antonym) and inter-group (synonym–antonym) word pairs across embedding, middle, and final layers in three models.
  • Figure 4: Mean change of target logit on the training set as a function of $k$. Yellow line: mean |$\Delta$ logit| over relations; shaded band: interquartile range across layers. Vertical dotted line: reference $k$ = 327 (1% of SAE width). Horizontal dashed line: 90% cutoff. Blue marker: smallest $k$ meeting the cutoff.