Historical Ink: Semantic Shift Detection for 19th Century Spanish

Tony Montes; Laura Manrique-Gómez; Rubén Manrique

Historical Ink: Semantic Shift Detection for 19th Century Spanish

Tony Montes, Laura Manrique-Gómez, Rubén Manrique

TL;DR

This work targets semantic shift detection in 19th-century Spanish, emphasizing Latin American varieties, by building a large old-Spanish corpus $C_{old}$ (1800–1914) and a modern reference $C_{new}$, then implementing a modular SSD pipeline that combines retrieval of word occurrences, fine-tuned BERT-like embeddings ($$,$MLM$$) on corpus-specific data, and diachronic clustering into DWUGs. The methodology uses joint clustering with Affinity Propagation and KMeans, evaluates multiple Spanish LMs (notably BETO fine-tuned on Latin American data), and measures shifts via Cosine Distance $f_{CD}$ and Inverted similarity over Word Prototype $f_{PRT}$ across senses $\Psi_{w,s,t}$. Key findings include the effectiveness of single-sense words for AP and multi-sense words for KMeans, and concrete historical-semantic insights such as the shift of words like “mujeres” and “infancia” reflecting 19th-century social discourse. The approach advances digital humanities by providing a scalable, reproducible framework for SSD in historical multilingual contexts and demonstrates how linguistic shifts encode cultural and political transformations with potential applications in linguistics, history, and sociology.

Abstract

This paper explores the evolution of word meanings in 19th-century Spanish texts, with an emphasis on Latin American Spanish, using computational linguistics techniques. It addresses the Semantic Shift Detection (SSD) task, which is crucial for understanding linguistic evolution, particularly in historical contexts. The study focuses on analyzing a set of Spanish target words. To achieve this, a 19th-century Spanish corpus is constructed, and a customizable pipeline for SSD tasks is developed. This pipeline helps find the senses of a word and measure their semantic change between two corpora using fine-tuned BERT-like models with old Spanish texts for both Latin American and general Spanish cases. The results provide valuable insights into the cultural and societal shifts reflected in language changes over time.

Historical Ink: Semantic Shift Detection for 19th Century Spanish

TL;DR

This work targets semantic shift detection in 19th-century Spanish, emphasizing Latin American varieties, by building a large old-Spanish corpus

(1800–1914) and a modern reference

, then implementing a modular SSD pipeline that combines retrieval of word occurrences, fine-tuned BERT-like embeddings (

) on corpus-specific data, and diachronic clustering into DWUGs. The methodology uses joint clustering with Affinity Propagation and KMeans, evaluates multiple Spanish LMs (notably BETO fine-tuned on Latin American data), and measures shifts via Cosine Distance

and Inverted similarity over Word Prototype

across senses

. Key findings include the effectiveness of single-sense words for AP and multi-sense words for KMeans, and concrete historical-semantic insights such as the shift of words like “mujeres” and “infancia” reflecting 19th-century social discourse. The approach advances digital humanities by providing a scalable, reproducible framework for SSD in historical multilingual contexts and demonstrates how linguistic shifts encode cultural and political transformations with potential applications in linguistics, history, and sociology.

Abstract

Paper Structure (16 sections, 4 equations, 7 figures, 4 tables)

This paper contains 16 sections, 4 equations, 7 figures, 4 tables.

Introduction
Related Work
Data
Cleaning
Chunking
Methodology
Find the Occurrences
Word Embeddings
Clustering
Semantic Shift Measurement
Evaluation and Model Selection
Results
Acknowledgements
Usage Examples per Sense
SSD Examples
...and 1 more sections

Figures (7)

Figure 1: Final corpus distribution by source. The percentage is computed over the total number of rows of the whole $C_{old}$ chunked corpus
Figure 2: Final corpus distribution by decade. The percentage is computed over the total number of rows of the whole $C_{old}$ chunked corpus
Figure 3: Historical Ink SSD Pipeline Architecture
Figure 4: DWUG of the word "mujeres" (women), using the whole corpus fine-tuned model embeddings, the T-SNE dimensionality reduction algorithm, and the KMeans clustering algorithm (with the silhouette metric). Each color represents a meaning (cluster) of the word. The color changes between the left (old corpus) and center (modern corpus) images illustrate the overall semantic change between the two diachronic corpora.
Figure 5: Diachronic comparison of word "revolución" (revolution) and its related words, between the old and the modern period using PCA dimensionality reduction algorithm.
...and 2 more figures

Historical Ink: Semantic Shift Detection for 19th Century Spanish

TL;DR

Abstract

Historical Ink: Semantic Shift Detection for 19th Century Spanish

Authors

TL;DR

Abstract

Table of Contents

Figures (7)