Table of Contents
Fetching ...

Detection of Non-recorded Word Senses in English and Swedish

Jonathan Lautenschlager, Emma Sköldberg, Simon Hengchen, Dominik Schlechtweg

TL;DR

This paper tackles Unknown Sense Detection to aid dictionary maintenance by comparing word usages in modern and historical English and Swedish corpora against dictionary senses using Word-in-Context embeddings. It proposes a two-pronged approach: constructing target-usage and sense embeddings with XL-LEXEME, and evaluating similarity thresholds under a constrained, few-shot setting with human annotations. Through two annotation phases and extensive cross-validation, the authors demonstrate that automatic methods can substantially increase the detection of non-recorded senses, with Swedish predictions particularly strong, enabling practical updates to WordNet and the Swedish SO. The study also identifies challenges in headword identification, multiword expressions, and preprocessing, outlining concrete future work to improve robustness and applicability in lexicography.

Abstract

This study addresses the task of Unknown Sense Detection in English and Swedish. The primary objective of this task is to determine whether the meaning of a particular word usage is documented in a dictionary or not. For this purpose, sense entries are compared with word usages from modern and historical corpora using a pre-trained Word-in-Context embedder that allows us to model this task in a few-shot scenario. Additionally, we use human annotations on the target corpora to adapt hyperparameters and evaluate our models using 5-fold cross-validation. Compared to a random sample from a corpus, our model is able to considerably increase the detected number of word usages with non-recorded senses.

Detection of Non-recorded Word Senses in English and Swedish

TL;DR

This paper tackles Unknown Sense Detection to aid dictionary maintenance by comparing word usages in modern and historical English and Swedish corpora against dictionary senses using Word-in-Context embeddings. It proposes a two-pronged approach: constructing target-usage and sense embeddings with XL-LEXEME, and evaluating similarity thresholds under a constrained, few-shot setting with human annotations. Through two annotation phases and extensive cross-validation, the authors demonstrate that automatic methods can substantially increase the detection of non-recorded senses, with Swedish predictions particularly strong, enabling practical updates to WordNet and the Swedish SO. The study also identifies challenges in headword identification, multiword expressions, and preprocessing, outlining concrete future work to improve robustness and applicability in lexicography.

Abstract

This study addresses the task of Unknown Sense Detection in English and Swedish. The primary objective of this task is to determine whether the meaning of a particular word usage is documented in a dictionary or not. For this purpose, sense entries are compared with word usages from modern and historical corpora using a pre-trained Word-in-Context embedder that allows us to model this task in a few-shot scenario. Additionally, we use human annotations on the target corpora to adapt hyperparameters and evaluate our models using 5-fold cross-validation. Compared to a random sample from a corpus, our model is able to considerably increase the detected number of word usages with non-recorded senses.
Paper Structure (35 sections, 1 equation, 1 figure, 12 tables)

This paper contains 35 sections, 1 equation, 1 figure, 12 tables.

Figures (1)

  • Figure 1: Precisions and recalls of all five folds in the cross-validation round 10 of model E4_COS on English data