Table of Contents
Fetching ...

Cat, Rat, Meow: On the Alignment of Language Model and Human Term-Similarity Judgments

Lorenz Linhardt, Tom Neuhäuser, Lenka Tětková, Oliver Eberle

TL;DR

This work investigates how language models align with human term-similarity judgments using a triplet-based framework (3TT) to assess both representations across layers and generated behavior across 32 models, including pretrained and instruction-tuned variants. It finds that small models can achieve near-human representational alignment, instruction tuning generally boosts this alignment, and layer-wise patterns are highly model-dependent, while behavioral alignment increases with model size and is not fully predicted by representational alignment, with strong $r$ values between human agreement and the gamma metric $ccc$ across models. The study demonstrates the explanatory value of triplet-based evaluations for automatic assessment of semantic representations and highlights the need for large models to achieve robust behavioral alignment, informing approaches to build more trustworthy language systems. Overall, the results suggest a nuanced, size- and tuning-dependent relationship between how concepts are represented and how they are behaviorally manifested in language models, with implications for alignment research and model design.

Abstract

Small and mid-sized generative language models have gained increasing attention. Their size and availability make them amenable to being analyzed at a behavioral as well as a representational level, allowing investigations of how these levels interact. We evaluate 32 publicly available language models for their representational and behavioral alignment with human similarity judgments on a word triplet task. This provides a novel evaluation setting to probe semantic associations in language beyond common pairwise comparisons. We find that (1) even the representations of small language models can achieve human-level alignment, (2) instruction-tuned model variants can exhibit substantially increased agreement, (3) the pattern of alignment across layers is highly model dependent, and (4) alignment based on models' behavioral responses is highly dependent on model size, matching their representational alignment only for the largest evaluated models.

Cat, Rat, Meow: On the Alignment of Language Model and Human Term-Similarity Judgments

TL;DR

This work investigates how language models align with human term-similarity judgments using a triplet-based framework (3TT) to assess both representations across layers and generated behavior across 32 models, including pretrained and instruction-tuned variants. It finds that small models can achieve near-human representational alignment, instruction tuning generally boosts this alignment, and layer-wise patterns are highly model-dependent, while behavioral alignment increases with model size and is not fully predicted by representational alignment, with strong values between human agreement and the gamma metric across models. The study demonstrates the explanatory value of triplet-based evaluations for automatic assessment of semantic representations and highlights the need for large models to achieve robust behavioral alignment, informing approaches to build more trustworthy language systems. Overall, the results suggest a nuanced, size- and tuning-dependent relationship between how concepts are represented and how they are behaviorally manifested in language models, with implications for alignment research and model design.

Abstract

Small and mid-sized generative language models have gained increasing attention. Their size and availability make them amenable to being analyzed at a behavioral as well as a representational level, allowing investigations of how these levels interact. We evaluate 32 publicly available language models for their representational and behavioral alignment with human similarity judgments on a word triplet task. This provides a novel evaluation setting to probe semantic associations in language beyond common pairwise comparisons. We find that (1) even the representations of small language models can achieve human-level alignment, (2) instruction-tuned model variants can exhibit substantially increased agreement, (3) the pattern of alignment across layers is highly model dependent, and (4) alignment based on models' behavioral responses is highly dependent on model size, matching their representational alignment only for the largest evaluated models.

Paper Structure

This paper contains 16 sections, 1 equation, 10 figures, 4 tables.

Figures (10)

  • Figure 1: We assess the alignment of language model representations at different layers (attention block, MLP, residual stream) with the human similarity space via a triplet task: Human raters judge which of twoterms is more similar to an anchor term. The human choice is compared to the model choice, which is based on representational similarities to the anchor. The fraction of agreeing choices is the "choice accuracy".
  • Figure 2: Representational and behavioral choice accuracy v.s. model size for instruction-tuned (I.T.) models. OpenELM models are excluded due to poor instruction following.
  • Figure 3: Choice accuracy across layers for two pretrained models. (Left) in OpenELM-1.1B, choice accuracy rises nearly monotonically, (right) in Gemma2-2B, a bimodal pattern can be observed.
  • Figure 4: Choice-accuracy before and after representation centering for pretrained (left) and instruction tuned (right) models. In most cases, centering leads to higher choice accuracy (below the diagonal).
  • Figure 5: Total variation of choice accuracy over all layers of a particular type. Each box aggregates all pretrained models.
  • ...and 5 more figures