Cat, Rat, Meow: On the Alignment of Language Model and Human Term-Similarity Judgments
Lorenz Linhardt, Tom Neuhäuser, Lenka Tětková, Oliver Eberle
TL;DR
This work investigates how language models align with human term-similarity judgments using a triplet-based framework (3TT) to assess both representations across layers and generated behavior across 32 models, including pretrained and instruction-tuned variants. It finds that small models can achieve near-human representational alignment, instruction tuning generally boosts this alignment, and layer-wise patterns are highly model-dependent, while behavioral alignment increases with model size and is not fully predicted by representational alignment, with strong $r$ values between human agreement and the gamma metric $ccc$ across models. The study demonstrates the explanatory value of triplet-based evaluations for automatic assessment of semantic representations and highlights the need for large models to achieve robust behavioral alignment, informing approaches to build more trustworthy language systems. Overall, the results suggest a nuanced, size- and tuning-dependent relationship between how concepts are represented and how they are behaviorally manifested in language models, with implications for alignment research and model design.
Abstract
Small and mid-sized generative language models have gained increasing attention. Their size and availability make them amenable to being analyzed at a behavioral as well as a representational level, allowing investigations of how these levels interact. We evaluate 32 publicly available language models for their representational and behavioral alignment with human similarity judgments on a word triplet task. This provides a novel evaluation setting to probe semantic associations in language beyond common pairwise comparisons. We find that (1) even the representations of small language models can achieve human-level alignment, (2) instruction-tuned model variants can exhibit substantially increased agreement, (3) the pattern of alignment across layers is highly model dependent, and (4) alignment based on models' behavioral responses is highly dependent on model size, matching their representational alignment only for the largest evaluated models.
