Table of Contents
Fetching ...

To Word Senses and Beyond: Inducing Concepts with Contextualized Language Models

Bastien Liétard, Pascal Denis, Mikaela Keller

TL;DR

This work introduces Concept Induction (CI), an unsupervised task that learns a soft clustering of a lexicon into latent concepts, thereby unifying polysemy and synonymy within a single framework. A bi-level approach combines local lemma-centric clustering with global cross-lexicon clustering, leveraging contextualized language model embeddings to produce concept clusters that generalize WordNet-like synsets. Experiments on SemCor show CI achieves competitive $F_1$ scores (exceeding $0.60$) and improves Word Sense Induction when data are scarce, while enabling competitive concept-aware embeddings for Word-in-Context tasks with much less data than prior methods. The results demonstrate the value of a concept-centered perspective for lexical semantics and point to practical applications in resource-scarce languages and downstream semantic tasks, along with ethical considerations related to biases encoded in language models.

Abstract

Polysemy and synonymy are two crucial interrelated facets of lexical ambiguity. While both phenomena are widely documented in lexical resources and have been studied extensively in NLP, leading to dedicated systems, they are often being considered independently in practical problems. While many tasks dealing with polysemy (e.g. Word Sense Disambiguation or Induction) highlight the role of word's senses, the study of synonymy is rooted in the study of concepts, i.e. meanings shared across the lexicon. In this paper, we introduce Concept Induction, the unsupervised task of learning a soft clustering among words that defines a set of concepts directly from data. This task generalizes Word Sense Induction. We propose a bi-level approach to Concept Induction that leverages both a local lemma-centric view and a global cross-lexicon view to induce concepts. We evaluate the obtained clustering on SemCor's annotated data and obtain good performance (BCubed F1 above 0.60). We find that the local and the global levels are mutually beneficial to induce concepts and also senses in our setting. Finally, we create static embeddings representing our induced concepts and use them on the Word-in-Context task, obtaining competitive performance with the State-of-the-Art.

To Word Senses and Beyond: Inducing Concepts with Contextualized Language Models

TL;DR

This work introduces Concept Induction (CI), an unsupervised task that learns a soft clustering of a lexicon into latent concepts, thereby unifying polysemy and synonymy within a single framework. A bi-level approach combines local lemma-centric clustering with global cross-lexicon clustering, leveraging contextualized language model embeddings to produce concept clusters that generalize WordNet-like synsets. Experiments on SemCor show CI achieves competitive scores (exceeding ) and improves Word Sense Induction when data are scarce, while enabling competitive concept-aware embeddings for Word-in-Context tasks with much less data than prior methods. The results demonstrate the value of a concept-centered perspective for lexical semantics and point to practical applications in resource-scarce languages and downstream semantic tasks, along with ethical considerations related to biases encoded in language models.

Abstract

Polysemy and synonymy are two crucial interrelated facets of lexical ambiguity. While both phenomena are widely documented in lexical resources and have been studied extensively in NLP, leading to dedicated systems, they are often being considered independently in practical problems. While many tasks dealing with polysemy (e.g. Word Sense Disambiguation or Induction) highlight the role of word's senses, the study of synonymy is rooted in the study of concepts, i.e. meanings shared across the lexicon. In this paper, we introduce Concept Induction, the unsupervised task of learning a soft clustering among words that defines a set of concepts directly from data. This task generalizes Word Sense Induction. We propose a bi-level approach to Concept Induction that leverages both a local lemma-centric view and a global cross-lexicon view to induce concepts. We evaluate the obtained clustering on SemCor's annotated data and obtain good performance (BCubed F1 above 0.60). We find that the local and the global levels are mutually beneficial to induce concepts and also senses in our setting. Finally, we create static embeddings representing our induced concepts and use them on the Word-in-Context task, obtaining competitive performance with the State-of-the-Art.
Paper Structure (41 sections, 3 equations, 2 figures, 6 tables)

This paper contains 41 sections, 3 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Illustration of our framework. The words "trial" is polysemous and has two senses corresponding to two different concepts, and is synonym with "test" for this second meaning.
  • Figure 2: Distribution of cluster size (in number of lemmas) obtained by the Bi-level Agglo system.