Tracking Semantic Change in Slovene: A Novel Dataset and Optimal Transport-Based Distance
Marko Pranjić, Kaja Dobrovoljc, Senja Pollak, Matej Martinc
TL;DR
This work tackles semantic-change detection in Slovene by creating the first gold-standard dataset with 3,150 sentence pairs across two periods and 104 target words, and by proposing a regularized optimal-transport distance to quantify diachronic shifts. It systematically compares static, contextual, clustering, and OT-based measures, showing that entropically regularized OT often yields the strongest agreement with human annotations, especially when using layer $L_{11}$ of SloBERTa. The study also reveals that a simplified APD formulation reduces computational complexity while clarifying its limitations, and demonstrates that regularized OT can unify insights from APD and WD. Overall, the results push forward robust semantic-change assessment in a low-resource language and provide a data-and-code release to support future research.
Abstract
In this paper, we focus on the detection of semantic changes in Slovene, a less resourced Slavic language with two million speakers. Detecting and tracking semantic changes provides insight into the evolution of language caused by changes in society and culture. We present the first Slovene dataset for evaluating semantic change detection systems, which contains aggregated semantic change scores for 104 target words obtained from more than 3,000 manually annotated sentence pairs. We analyze an important class of measures of semantic change metrics based on the Average pairwise distance and identify several limitations. To address these limitations, we propose a novel metric based on regularized optimal transport, which offers a more robust framework for quantifying semantic change. We provide a comprehensive evaluation of various existing semantic change detection methods and associated semantic change measures on our dataset. Through empirical testing, we demonstrate that our proposed approach, leveraging regularized optimal transport, achieves either matching or improved performance compared to baseline approaches.
