The LSCD Benchmark: a Testbed for Diachronic Word Meaning Tasks
Dominik Schlechtweg, Sachin Yadav, Nikolay Arefyev
TL;DR
The paper tackles fragmentation in lexical semantic change research by introducing a shared LSCD benchmark that standardizes evaluation across Word-in-Context (WiC), Word Sense Induction (WSI), and Lexical Semantic Change Detection (LSCD) tasks, applied to multilingual, diachronic data. It implements a modular pipeline—Retrieve uses, Build contextualized embeddings, Generate pairs, WiC scoring, Clustering, and Aggregate measures—enabling both end-to-end and component-level evaluation with standard splits and robust metrics such as $APD$, $COS$, and $JSD$. Across diverse datasets, the authors establish strong baselines with models like $XL-DURel$ and $XL-LEXEME$, while revealing that WiC performance largely drives LSCD outcomes, albeit with notable exceptions. The benchmark thus promotes reproducibility, cross-task transfer, and language-diverse evaluation, providing a foundation for future improvements in diachronic lexical semantics and model generalization.
Abstract
Lexical Semantic Change Detection (LSCD) is a complex, lemma-level task, which is usually operationalized based on two subsequently applied usage-level tasks: First, Word-in-Context (WiC) labels are derived for pairs of usages. Then, these labels are represented in a graph on which Word Sense Induction (WSI) is applied to derive sense clusters. Finally, LSCD labels are derived by comparing sense clusters over time. This modularity is reflected in most LSCD datasets and models. It also leads to a large heterogeneity in modeling options and task definitions, which is exacerbated by a variety of dataset versions, preprocessing options and evaluation metrics. This heterogeneity makes it difficult to evaluate models under comparable conditions, to choose optimal model combinations or to reproduce results. Hence, we provide a benchmark repository standardizing LSCD evaluation. Through transparent implementation results become easily reproducible and by standardization different components can be freely combined. The repository reflects the task's modularity by allowing model evaluation for WiC, WSI and LSCD. This allows for careful evaluation of increasingly complex model components providing new ways of model optimization. We use the implemented benchmark to conduct a number of experiments with recent models and systematically improve the state-of-the-art.
