Table of Contents
Fetching ...

SimLex-999: Evaluating Semantic Models with (Genuine) Similarity Estimation

Felix Hill, Roi Reichart, Anna Korhonen

TL;DR

SimLex-999 tackles the core challenge that existing semantic-evaluation resources largely conflate similarity with association. It introduces a context-free, graded similarity dataset spanning adjectives, nouns, and verbs with controlled concreteness, enabling fine-grained analysis of how models capture true similarity across concept types. The study shows state-of-the-art distributional models still lag behind human judgments on SimLex-999, highlighting the difficulty of modelling genuine similarity and revealing that dependency-informed input and smaller context windows can improve performance for similarity over association. By providing a diverse, analyzable benchmark with meta-information on POS and concreteness, SimLex-999 guides the development of next-generation distributional semantic representations and grounded, concept-level language understanding. Overall, the work demonstrates substantial room for improvement and offers concrete insights into how to tailor architectures to capture human-like similarity more accurately, with significant implications for lexical resources, translation, and semantic parsing.

Abstract

We present SimLex-999, a gold standard resource for evaluating distributional semantic models that improves on existing resources in several important ways. First, in contrast to gold standards such as WordSim-353 and MEN, it explicitly quantifies similarity rather than association or relatedness, so that pairs of entities that are associated but not actually similar [Freud, psychology] have a low rating. We show that, via this focus on similarity, SimLex-999 incentivizes the development of models with a different, and arguably wider range of applications than those which reflect conceptual association. Second, SimLex-999 contains a range of concrete and abstract adjective, noun and verb pairs, together with an independent rating of concreteness and (free) association strength for each pair. This diversity enables fine-grained analyses of the performance of models on concepts of different types, and consequently greater insight into how architectures can be improved. Further, unlike existing gold standard evaluations, for which automatic approaches have reached or surpassed the inter-annotator agreement ceiling, state-of-the-art models perform well below this ceiling on SimLex-999. There is therefore plenty of scope for SimLex-999 to quantify future improvements to distributional semantic models, guiding the development of the next generation of representation-learning architectures.

SimLex-999: Evaluating Semantic Models with (Genuine) Similarity Estimation

TL;DR

SimLex-999 tackles the core challenge that existing semantic-evaluation resources largely conflate similarity with association. It introduces a context-free, graded similarity dataset spanning adjectives, nouns, and verbs with controlled concreteness, enabling fine-grained analysis of how models capture true similarity across concept types. The study shows state-of-the-art distributional models still lag behind human judgments on SimLex-999, highlighting the difficulty of modelling genuine similarity and revealing that dependency-informed input and smaller context windows can improve performance for similarity over association. By providing a diverse, analyzable benchmark with meta-information on POS and concreteness, SimLex-999 guides the development of next-generation distributional semantic representations and grounded, concept-level language understanding. Overall, the work demonstrates substantial room for improvement and offers concrete insights into how to tailor architectures to capture human-like similarity more accurately, with significant implications for lexical resources, translation, and semantic parsing.

Abstract

We present SimLex-999, a gold standard resource for evaluating distributional semantic models that improves on existing resources in several important ways. First, in contrast to gold standards such as WordSim-353 and MEN, it explicitly quantifies similarity rather than association or relatedness, so that pairs of entities that are associated but not actually similar [Freud, psychology] have a low rating. We show that, via this focus on similarity, SimLex-999 incentivizes the development of models with a different, and arguably wider range of applications than those which reflect conceptual association. Second, SimLex-999 contains a range of concrete and abstract adjective, noun and verb pairs, together with an independent rating of concreteness and (free) association strength for each pair. This diversity enables fine-grained analyses of the performance of models on concepts of different types, and consequently greater insight into how architectures can be improved. Further, unlike existing gold standard evaluations, for which automatic approaches have reached or surpassed the inter-annotator agreement ceiling, state-of-the-art models perform well below this ceiling on SimLex-999. There is therefore plenty of scope for SimLex-999 to quantify future improvements to distributional semantic models, guiding the development of the next generation of representation-learning architectures.

Paper Structure

This paper contains 41 sections, 4 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Boxplots showing the interaction between concreteness and POS for concepts in USF. The white boxes range from the first to third quartiles and the central vertical line indicates the median.
  • Figure 2: Instructions for SimLex-999 annotators.
  • Figure 3: A group of noun pairs to be rated by moving the sliders. The rating slider was initially at position 0, and it was possible to attribute a rating of 0, although it was necessary to have actively moved the slider to that position to proceed to the next page.
  • Figure 4: Left: Inter-annotator agreement, measured by average pairwise Spearman $\rho$ correlation, for ratings of concepts of different types in SimLex-999. Right: Response consistency, reflecting the standard deviation of annotator ratings for each pair, averaged over all pairs in the concept category.
  • Figure 5: (a) Pairs rated by WS-353 annotators (blue points, ranked by rating) and the corresponding rating of annotators following the SimLex-999 instructions (red points). (b-c) The same analysis, restricted to pairs in the WS-Sim or WS-Rel subsets of WS-353.
  • ...and 6 more figures