Table of Contents
Fetching ...

Analyzing Continuous Semantic Shifts with Diachronic Word Similarity Matrices

Hajime Kiyama, Taichi Aida, Mamoru Komachi, Toshinobu Ogiso, Hiroya Takamura, Daichi Mochihashi

TL;DR

This paper proposes a diachronic word similarity matrix framework to analyze semantic shifts across arbitrary time periods, addressing limitations of adjacent-period change detection and computationally expensive sense-distribution methods. By aligning embeddings over time and computing a diachronic similarity matrix $S(w)\in\mathbb{R}^{T\times T}$ for each word, then clustering these matrices across words, the method reveals multi-period shift dynamics and groups words with similar trajectories without assuming predefined shift types. The authors validate the approach on real English corpora (COHA, COCA) and a Japanese corpus (Mainichi Shimbun) using a $100$-dimensional PPMI-SVD joint, showing interpretable visualizations, PPMI-based interpretability via $\Delta M^{(t_2\rightarrow t_1)}$, and unsupervised clustering that captures sociolinguistic factors. They further test the framework on pseudo data with seven shift schemas, demonstrating competitive classification performance and highlighting the importance of cosine similarity and feature choice, particularly upper triangular components and standardization. Overall, the approach enables scalable, multi-period semantic-shift analysis and offers practical tools for linguists and NLP practitioners to detect, interpret, and cluster shifts across long temporal spans.

Abstract

The meanings and relationships of words shift over time. This phenomenon is referred to as semantic shift. Research focused on understanding how semantic shifts occur over multiple time periods is essential for gaining a detailed understanding of semantic shifts. However, detecting change points only between adjacent time periods is insufficient for analyzing detailed semantic shifts, and using BERT-based methods to examine word sense proportions incurs a high computational cost. To address those issues, we propose a simple yet intuitive framework for how semantic shifts occur over multiple time periods by leveraging a similarity matrix between the embeddings of the same word through time. We compute a diachronic word similarity matrix using fast and lightweight word embeddings across arbitrary time periods, making it deeper to analyze continuous semantic shifts. Additionally, by clustering the similarity matrices for different words, we can categorize words that exhibit similar behavior of semantic shift in an unsupervised manner.

Analyzing Continuous Semantic Shifts with Diachronic Word Similarity Matrices

TL;DR

This paper proposes a diachronic word similarity matrix framework to analyze semantic shifts across arbitrary time periods, addressing limitations of adjacent-period change detection and computationally expensive sense-distribution methods. By aligning embeddings over time and computing a diachronic similarity matrix for each word, then clustering these matrices across words, the method reveals multi-period shift dynamics and groups words with similar trajectories without assuming predefined shift types. The authors validate the approach on real English corpora (COHA, COCA) and a Japanese corpus (Mainichi Shimbun) using a -dimensional PPMI-SVD joint, showing interpretable visualizations, PPMI-based interpretability via , and unsupervised clustering that captures sociolinguistic factors. They further test the framework on pseudo data with seven shift schemas, demonstrating competitive classification performance and highlighting the importance of cosine similarity and feature choice, particularly upper triangular components and standardization. Overall, the approach enables scalable, multi-period semantic-shift analysis and offers practical tools for linguists and NLP practitioners to detect, interpret, and cluster shifts across long temporal spans.

Abstract

The meanings and relationships of words shift over time. This phenomenon is referred to as semantic shift. Research focused on understanding how semantic shifts occur over multiple time periods is essential for gaining a detailed understanding of semantic shifts. However, detecting change points only between adjacent time periods is insufficient for analyzing detailed semantic shifts, and using BERT-based methods to examine word sense proportions incurs a high computational cost. To address those issues, we propose a simple yet intuitive framework for how semantic shifts occur over multiple time periods by leveraging a similarity matrix between the embeddings of the same word through time. We compute a diachronic word similarity matrix using fast and lightweight word embeddings across arbitrary time periods, making it deeper to analyze continuous semantic shifts. Additionally, by clustering the similarity matrices for different words, we can categorize words that exhibit similar behavior of semantic shift in an unsupervised manner.
Paper Structure (38 sections, 3 equations, 16 figures, 7 tables)

This paper contains 38 sections, 3 equations, 16 figures, 7 tables.

Figures (16)

  • Figure 1: A framework for analysing diachronic semantic shifts using similarity matrices: 1. Calculate the word similarity matrix for a target word using word embeddings trained for each period. 2. Perform analyses such as clustering on the similarity matrices for all words.
  • Figure 2: The visualization of the diachronic word cosine similarity matrix for the word "record" and "president" by PPMI-SVD joint. It is evident that clusters and spikes across time, indicating two types of semantic shifts (linguistic and social), have been successfully represented.
  • Figure 5: Visualizing the similarity matrix of all words in COHA using t-SNE in two dimensions shows that words close to each other in the compressed dimensions exhibit similar similarity patterns.
  • Figure 6: Visualization of clusters containing target words for each dataset was performed using hierarchical clustering for all words. This method allows us to observe how clusters that share a similar time-series pattern merge, providing insights into the clustering process and the relationships between words within the dataset.
  • Figure 9: Illustration of seven schemas for inserting pseudowords into the synthetic dataset shoemark-etal-2019-room. The orange line represents $sense_1$, the black dotted line represents $sense_2$, and the other lines correspond to the remaining $senses$.
  • ...and 11 more figures