Table of Contents
Fetching ...

DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages

Dominik Schlechtweg, Nina Tahmasebi, Simon Hengchen, Haim Dubossarsky, Barbara McGillivray

TL;DR

The paper tackles the challenge of capturing diachronic, graded word meaning at scale by introducing DWUGs, a large multilingual resource built via two annotation paradigms that generate Usage Usage Graphs (UUGs) and Usage-Sense Graphs (USGs). It combines multi-round, edge-efficient annotation with a robust clustering objective $L(C)$ to infer sense structure across English, German, Swedish, and Latin, yielding around 100k judgments across two time periods per language. Key contributions include the largest diachronic graded-meaning dataset to date, a detailed annotation pipeline with robustness analyses, and public release of clusterings and visualizations to support evaluation of contextualized embeddings and semantic-change detection models. The work demonstrates practical impact for evaluating and training diachronic NLP systems and shows how Latin adaptation can be integrated when native annotators are limited. Overall, the resource enables nuanced, time-aware semantic evaluation across multiple languages and provides a foundation for future improvements in clustering methods and annotation strategies.

Abstract

Word meaning is notoriously difficult to capture, both synchronically and diachronically. In this paper, we describe the creation of the largest resource of graded contextualized, diachronic word meaning annotation in four different languages, based on 100,000 human semantic proximity judgments. We thoroughly describe the multi-round incremental annotation process, the choice for a clustering algorithm to group usages into senses, and possible - diachronic and synchronic - uses for this dataset.

DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages

TL;DR

The paper tackles the challenge of capturing diachronic, graded word meaning at scale by introducing DWUGs, a large multilingual resource built via two annotation paradigms that generate Usage Usage Graphs (UUGs) and Usage-Sense Graphs (USGs). It combines multi-round, edge-efficient annotation with a robust clustering objective to infer sense structure across English, German, Swedish, and Latin, yielding around 100k judgments across two time periods per language. Key contributions include the largest diachronic graded-meaning dataset to date, a detailed annotation pipeline with robustness analyses, and public release of clusterings and visualizations to support evaluation of contextualized embeddings and semantic-change detection models. The work demonstrates practical impact for evaluating and training diachronic NLP systems and shows how Latin adaptation can be integrated when native annotators are limited. Overall, the resource enables nuanced, time-aware semantic evaluation across multiple languages and provides a foundation for future improvements in clustering methods and annotation strategies.

Abstract

Word meaning is notoriously difficult to capture, both synchronically and diachronically. In this paper, we describe the creation of the largest resource of graded contextualized, diachronic word meaning annotation in four different languages, based on 100,000 human semantic proximity judgments. We thoroughly describe the multi-round incremental annotation process, the choice for a clustering algorithm to group usages into senses, and possible - diachronic and synchronic - uses for this dataset.

Paper Structure

This paper contains 19 sections, 1 equation, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Usage-usage graphs of English plane (left), German ausspannen (middle) and Swedish ledning (right). Nodes represent usages of the respective target word. Edge weights represent the median of relatedness judgments between usages (black/gray lines for high/low edge weights, i.e., weights $\geq$ 2.5/weights $<$ 2.5).
  • Figure 2: Usage-usage graph of Swedish ledning (left), subgraph for first time period $C_1$ (middle) and second time period $C_2$ (right).
  • Figure 3: Usage-sense graphs of Latin pontifex (left), potestas (middle) and sacramentum (right). Nodes in blue/red represent usages/senses respectively.
  • Figure 4: Usage-sense graph of Latin sacramentum (left), subgraph for first time period $C_1$ (middle) and second time period $C_2$ (right).
  • Figure 5: Judgment frequency per language.
  • ...and 3 more figures