Table of Contents
Fetching ...

Tracking Semantic Change in Slovene: A Novel Dataset and Optimal Transport-Based Distance

Marko Pranjić, Kaja Dobrovoljc, Senja Pollak, Matej Martinc

TL;DR

This work tackles semantic-change detection in Slovene by creating the first gold-standard dataset with 3,150 sentence pairs across two periods and 104 target words, and by proposing a regularized optimal-transport distance to quantify diachronic shifts. It systematically compares static, contextual, clustering, and OT-based measures, showing that entropically regularized OT often yields the strongest agreement with human annotations, especially when using layer $L_{11}$ of SloBERTa. The study also reveals that a simplified APD formulation reduces computational complexity while clarifying its limitations, and demonstrates that regularized OT can unify insights from APD and WD. Overall, the results push forward robust semantic-change assessment in a low-resource language and provide a data-and-code release to support future research.

Abstract

In this paper, we focus on the detection of semantic changes in Slovene, a less resourced Slavic language with two million speakers. Detecting and tracking semantic changes provides insight into the evolution of language caused by changes in society and culture. We present the first Slovene dataset for evaluating semantic change detection systems, which contains aggregated semantic change scores for 104 target words obtained from more than 3,000 manually annotated sentence pairs. We analyze an important class of measures of semantic change metrics based on the Average pairwise distance and identify several limitations. To address these limitations, we propose a novel metric based on regularized optimal transport, which offers a more robust framework for quantifying semantic change. We provide a comprehensive evaluation of various existing semantic change detection methods and associated semantic change measures on our dataset. Through empirical testing, we demonstrate that our proposed approach, leveraging regularized optimal transport, achieves either matching or improved performance compared to baseline approaches.

Tracking Semantic Change in Slovene: A Novel Dataset and Optimal Transport-Based Distance

TL;DR

This work tackles semantic-change detection in Slovene by creating the first gold-standard dataset with 3,150 sentence pairs across two periods and 104 target words, and by proposing a regularized optimal-transport distance to quantify diachronic shifts. It systematically compares static, contextual, clustering, and OT-based measures, showing that entropically regularized OT often yields the strongest agreement with human annotations, especially when using layer of SloBERTa. The study also reveals that a simplified APD formulation reduces computational complexity while clarifying its limitations, and demonstrates that regularized OT can unify insights from APD and WD. Overall, the results push forward robust semantic-change assessment in a low-resource language and provide a data-and-code release to support future research.

Abstract

In this paper, we focus on the detection of semantic changes in Slovene, a less resourced Slavic language with two million speakers. Detecting and tracking semantic changes provides insight into the evolution of language caused by changes in society and culture. We present the first Slovene dataset for evaluating semantic change detection systems, which contains aggregated semantic change scores for 104 target words obtained from more than 3,000 manually annotated sentence pairs. We analyze an important class of measures of semantic change metrics based on the Average pairwise distance and identify several limitations. To address these limitations, we propose a novel metric based on regularized optimal transport, which offers a more robust framework for quantifying semantic change. We provide a comprehensive evaluation of various existing semantic change detection methods and associated semantic change measures on our dataset. Through empirical testing, we demonstrate that our proposed approach, leveraging regularized optimal transport, achieves either matching or improved performance compared to baseline approaches.
Paper Structure (19 sections, 10 equations, 2 figures, 8 tables)

This paper contains 19 sections, 10 equations, 2 figures, 8 tables.

Figures (2)

  • Figure 1: Effect of the entropic regularization on the result of optimal transport results. First subfigure shows optimal transport results without regularization, the one in the middle uses small regularization ($\lambda=0.7$) and the last one uses very high value ($\lambda=100$) showing near-equivalence with APD.
  • Figure 2: Analysis of the magnitudes of hidden state representations across layers. For each hidden layer, we collect target word representations and plot their Euclidean norms.