Table of Contents
Fetching ...

The LSCD Benchmark: a Testbed for Diachronic Word Meaning Tasks

Dominik Schlechtweg, Sachin Yadav, Nikolay Arefyev

TL;DR

The paper tackles fragmentation in lexical semantic change research by introducing a shared LSCD benchmark that standardizes evaluation across Word-in-Context (WiC), Word Sense Induction (WSI), and Lexical Semantic Change Detection (LSCD) tasks, applied to multilingual, diachronic data. It implements a modular pipeline—Retrieve uses, Build contextualized embeddings, Generate pairs, WiC scoring, Clustering, and Aggregate measures—enabling both end-to-end and component-level evaluation with standard splits and robust metrics such as $APD$, $COS$, and $JSD$. Across diverse datasets, the authors establish strong baselines with models like $XL-DURel$ and $XL-LEXEME$, while revealing that WiC performance largely drives LSCD outcomes, albeit with notable exceptions. The benchmark thus promotes reproducibility, cross-task transfer, and language-diverse evaluation, providing a foundation for future improvements in diachronic lexical semantics and model generalization.

Abstract

Lexical Semantic Change Detection (LSCD) is a complex, lemma-level task, which is usually operationalized based on two subsequently applied usage-level tasks: First, Word-in-Context (WiC) labels are derived for pairs of usages. Then, these labels are represented in a graph on which Word Sense Induction (WSI) is applied to derive sense clusters. Finally, LSCD labels are derived by comparing sense clusters over time. This modularity is reflected in most LSCD datasets and models. It also leads to a large heterogeneity in modeling options and task definitions, which is exacerbated by a variety of dataset versions, preprocessing options and evaluation metrics. This heterogeneity makes it difficult to evaluate models under comparable conditions, to choose optimal model combinations or to reproduce results. Hence, we provide a benchmark repository standardizing LSCD evaluation. Through transparent implementation results become easily reproducible and by standardization different components can be freely combined. The repository reflects the task's modularity by allowing model evaluation for WiC, WSI and LSCD. This allows for careful evaluation of increasingly complex model components providing new ways of model optimization. We use the implemented benchmark to conduct a number of experiments with recent models and systematically improve the state-of-the-art.

The LSCD Benchmark: a Testbed for Diachronic Word Meaning Tasks

TL;DR

The paper tackles fragmentation in lexical semantic change research by introducing a shared LSCD benchmark that standardizes evaluation across Word-in-Context (WiC), Word Sense Induction (WSI), and Lexical Semantic Change Detection (LSCD) tasks, applied to multilingual, diachronic data. It implements a modular pipeline—Retrieve uses, Build contextualized embeddings, Generate pairs, WiC scoring, Clustering, and Aggregate measures—enabling both end-to-end and component-level evaluation with standard splits and robust metrics such as , , and . Across diverse datasets, the authors establish strong baselines with models like and , while revealing that WiC performance largely drives LSCD outcomes, albeit with notable exceptions. The benchmark thus promotes reproducibility, cross-task transfer, and language-diverse evaluation, providing a foundation for future improvements in diachronic lexical semantics and model generalization.

Abstract

Lexical Semantic Change Detection (LSCD) is a complex, lemma-level task, which is usually operationalized based on two subsequently applied usage-level tasks: First, Word-in-Context (WiC) labels are derived for pairs of usages. Then, these labels are represented in a graph on which Word Sense Induction (WSI) is applied to derive sense clusters. Finally, LSCD labels are derived by comparing sense clusters over time. This modularity is reflected in most LSCD datasets and models. It also leads to a large heterogeneity in modeling options and task definitions, which is exacerbated by a variety of dataset versions, preprocessing options and evaluation metrics. This heterogeneity makes it difficult to evaluate models under comparable conditions, to choose optimal model combinations or to reproduce results. Hence, we provide a benchmark repository standardizing LSCD evaluation. Through transparent implementation results become easily reproducible and by standardization different components can be freely combined. The repository reflects the task's modularity by allowing model evaluation for WiC, WSI and LSCD. This allows for careful evaluation of increasingly complex model components providing new ways of model optimization. We use the implemented benchmark to conduct a number of experiments with recent models and systematically improve the state-of-the-art.
Paper Structure (24 sections, 6 figures, 1 table)

This paper contains 24 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Word Usage Graph of English plane (left), subgraphs for first time period $G_1$ (middle) and for second time period $G_2$ (right). black/gray lines indicate high/low edge weights.
  • Figure 2: Token-based LSCD pipelines and their evaluation.
  • Figure 3: Result overview on Graded Change (left) and COMPARE score (right). XLD = XL-DURel, XLM = XL-LEXEME, MCL/MCLen = DeepMistake checkpoints, TH = thresholded, APD = Average Pairwise Distance, DIA = DiaSense, COS = Cosine distance between average embeddings.
  • Figure 4: Result overview on WiC.
  • Figure 5: Result overview on dataset versions for bi-encoder models.
  • ...and 1 more figures