Tracking Semantic Change in Slovene: A Novel Dataset and Optimal Transport-Based Distance

Marko Pranjić; Kaja Dobrovoljc; Senja Pollak; Matej Martinc

Tracking Semantic Change in Slovene: A Novel Dataset and Optimal Transport-Based Distance

Marko Pranjić, Kaja Dobrovoljc, Senja Pollak, Matej Martinc

TL;DR

This work tackles semantic-change detection in Slovene by creating the first gold-standard dataset with 3,150 sentence pairs across two periods and 104 target words, and by proposing a regularized optimal-transport distance to quantify diachronic shifts. It systematically compares static, contextual, clustering, and OT-based measures, showing that entropically regularized OT often yields the strongest agreement with human annotations, especially when using layer $L_{11}$ of SloBERTa. The study also reveals that a simplified APD formulation reduces computational complexity while clarifying its limitations, and demonstrates that regularized OT can unify insights from APD and WD. Overall, the results push forward robust semantic-change assessment in a low-resource language and provide a data-and-code release to support future research.

Abstract

In this paper, we focus on the detection of semantic changes in Slovene, a less resourced Slavic language with two million speakers. Detecting and tracking semantic changes provides insight into the evolution of language caused by changes in society and culture. We present the first Slovene dataset for evaluating semantic change detection systems, which contains aggregated semantic change scores for 104 target words obtained from more than 3,000 manually annotated sentence pairs. We analyze an important class of measures of semantic change metrics based on the Average pairwise distance and identify several limitations. To address these limitations, we propose a novel metric based on regularized optimal transport, which offers a more robust framework for quantifying semantic change. We provide a comprehensive evaluation of various existing semantic change detection methods and associated semantic change measures on our dataset. Through empirical testing, we demonstrate that our proposed approach, leveraging regularized optimal transport, achieves either matching or improved performance compared to baseline approaches.

Tracking Semantic Change in Slovene: A Novel Dataset and Optimal Transport-Based Distance

TL;DR

of SloBERTa. The study also reveals that a simplified APD formulation reduces computational complexity while clarifying its limitations, and demonstrates that regularized OT can unify insights from APD and WD. Overall, the results push forward robust semantic-change assessment in a low-resource language and provide a data-and-code release to support future research.

Abstract

Paper Structure (19 sections, 10 equations, 2 figures, 8 tables)

This paper contains 19 sections, 10 equations, 2 figures, 8 tables.

Introduction
Related Work
Evaluation of Semantic Change
Systems for Automatic Change Detection
Measuring Semantic Change
Semantic Change Detection in Slovene
Dataset Construction
Corpus Selection
Word List Creation
Annotation
Inter-Annotator Agreement
Semantic Change Scores and Observations
Semantic Change Detection Through Optimal Transport
Average Pairwise Distance Metric
Optimal Transport With Entropic Regularization
...and 4 more sections

Figures (2)

Figure 1: Effect of the entropic regularization on the result of optimal transport results. First subfigure shows optimal transport results without regularization, the one in the middle uses small regularization ($\lambda=0.7$) and the last one uses very high value ($\lambda=100$) showing near-equivalence with APD.
Figure 2: Analysis of the magnitudes of hidden state representations across layers. For each hidden layer, we collect target word representations and plot their Euclidean norms.

Tracking Semantic Change in Slovene: A Novel Dataset and Optimal Transport-Based Distance

TL;DR

Abstract

Tracking Semantic Change in Slovene: A Novel Dataset and Optimal Transport-Based Distance

Authors

TL;DR

Abstract

Table of Contents

Figures (2)