To Word Senses and Beyond: Inducing Concepts with Contextualized Language Models
Bastien Liétard, Pascal Denis, Mikaela Keller
TL;DR
This work introduces Concept Induction (CI), an unsupervised task that learns a soft clustering of a lexicon into latent concepts, thereby unifying polysemy and synonymy within a single framework. A bi-level approach combines local lemma-centric clustering with global cross-lexicon clustering, leveraging contextualized language model embeddings to produce concept clusters that generalize WordNet-like synsets. Experiments on SemCor show CI achieves competitive $F_1$ scores (exceeding $0.60$) and improves Word Sense Induction when data are scarce, while enabling competitive concept-aware embeddings for Word-in-Context tasks with much less data than prior methods. The results demonstrate the value of a concept-centered perspective for lexical semantics and point to practical applications in resource-scarce languages and downstream semantic tasks, along with ethical considerations related to biases encoded in language models.
Abstract
Polysemy and synonymy are two crucial interrelated facets of lexical ambiguity. While both phenomena are widely documented in lexical resources and have been studied extensively in NLP, leading to dedicated systems, they are often being considered independently in practical problems. While many tasks dealing with polysemy (e.g. Word Sense Disambiguation or Induction) highlight the role of word's senses, the study of synonymy is rooted in the study of concepts, i.e. meanings shared across the lexicon. In this paper, we introduce Concept Induction, the unsupervised task of learning a soft clustering among words that defines a set of concepts directly from data. This task generalizes Word Sense Induction. We propose a bi-level approach to Concept Induction that leverages both a local lemma-centric view and a global cross-lexicon view to induce concepts. We evaluate the obtained clustering on SemCor's annotated data and obtain good performance (BCubed F1 above 0.60). We find that the local and the global levels are mutually beneficial to induce concepts and also senses in our setting. Finally, we create static embeddings representing our induced concepts and use them on the Word-in-Context task, obtaining competitive performance with the State-of-the-Art.
