Table of Contents
Fetching ...

Multilingual Substitution-based Word Sense Induction

Denis Kokosinskii, Nikolay Arefyev

TL;DR

This work tackles Word Sense Induction (WSI) across languages by introducing multilingual substitution-based methods built on the multilingual model XLM-R. It adapts monolingual substitution pipelines (Concat, WCM) and target-injection strategies (SDP, +embs) to a 100-language setting, enabling unsupervised WSI with minimal language-specific tuning. Evaluations across SE10, SE13, bts-rnc-ru, XL-WSD, and DWUG de Sense show that SDP-based multilingual WSI achieves competitive English SOTA performance, while monolingual finetuning of WCM yields robust cross-lingual substitutes that facilitate true multilingual WSI (e.g., WCM-en SDP-en). The results demonstrate practical potential for high-quality WSI in low-resource languages, reducing the need for language-specific lexical resources and enabling cross-lingual sense labeling and analysis with interpretable substitutes.

Abstract

Word Sense Induction (WSI) is the task of discovering senses of an ambiguous word by grouping usages of this word into clusters corresponding to these senses. Many approaches were proposed to solve WSI in English and a few other languages, but these approaches are not easily adaptable to new languages. We present multilingual substitution-based WSI methods that support any of 100 languages covered by the underlying multilingual language model with minimal to no adaptation required. Despite the multilingual capabilities, our methods perform on par with the existing monolingual approaches on popular English WSI datasets. At the same time, they will be most useful for lower-resourced languages which miss lexical resources available for English, thus, have higher demand for unsupervised methods like WSI.

Multilingual Substitution-based Word Sense Induction

TL;DR

This work tackles Word Sense Induction (WSI) across languages by introducing multilingual substitution-based methods built on the multilingual model XLM-R. It adapts monolingual substitution pipelines (Concat, WCM) and target-injection strategies (SDP, +embs) to a 100-language setting, enabling unsupervised WSI with minimal language-specific tuning. Evaluations across SE10, SE13, bts-rnc-ru, XL-WSD, and DWUG de Sense show that SDP-based multilingual WSI achieves competitive English SOTA performance, while monolingual finetuning of WCM yields robust cross-lingual substitutes that facilitate true multilingual WSI (e.g., WCM-en SDP-en). The results demonstrate practical potential for high-quality WSI in low-resource languages, reducing the need for language-specific lexical resources and enabling cross-lingual sense labeling and analysis with interpretable substitutes.

Abstract

Word Sense Induction (WSI) is the task of discovering senses of an ambiguous word by grouping usages of this word into clusters corresponding to these senses. Many approaches were proposed to solve WSI in English and a few other languages, but these approaches are not easily adaptable to new languages. We present multilingual substitution-based WSI methods that support any of 100 languages covered by the underlying multilingual language model with minimal to no adaptation required. Despite the multilingual capabilities, our methods perform on par with the existing monolingual approaches on popular English WSI datasets. At the same time, they will be most useful for lower-resourced languages which miss lexical resources available for English, thus, have higher demand for unsupervised methods like WSI.
Paper Structure (24 sections, 7 figures, 6 tables)

This paper contains 24 sections, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Substitution-based approach to Word Sense Induction.
  • Figure 2: Cumulative distribution of the number of XLM-R and BERT tokens per substitute in human-annotated lexical substitution datasets. English substitutes are from CoInCo CoInCo, German substitutes are from GermEval 2015 GermEval, and French substitutes are from SemDis 2014 SemDis. Multi-word substitutes are not taken into account.
  • Figure 3: Evaluation of substitute generators on the lexical substitution datasets . Blue is the concat substitute generator and orange is the WCM.
  • Figure 4: Distribution of Wordnet hyponymy relations for different target injection methods. Only top-20 substitutes are used for each instance.
  • Figure 5: Evaluation of different configurations of our system on the WSI dev sets.
  • ...and 2 more figures