Table of Contents
Fetching ...

Turkronicles: Diachronic Resources for the Fast Evolving Turkish Language

Togay Yazar, Mucahid Kutlu, İsa Kerem Bayırlı

TL;DR

Turkronicles provides the first large-scale diachronic Turkish resource combining Official Gazette and parliamentary records to analyze vocabulary and writing-convention changes since 1920. By building a diachronic pipeline with preprocessing, a modern-old Turkish dictionary, n-grams, and Embeddings (PPMI, SVD, CBOW) aligned via Orthogonal Procrustes, the work reveals substantial vocabulary turnover and a shift away from Arabic/Persian lexicon toward Turkish-origin terms, along with declining circumflex usage and evolving word-final phonology. The accompanying Lingan Python library enables reproducible, extensible diachronic analyses, and the study demonstrates concrete patterns such as increased divergence with time and substitution of old terms by new equivalents, underscoring the impact of state-driven language reform. The dataset and tools have broad implications for historical linguistics, sociolinguistics, and NLP in Turkish, and the authors plan to extend the corpus with additional sources and broader access to foster further research.

Abstract

Over the past century, the Turkish language has undergone substantial changes, primarily driven by governmental interventions. In this work, our goal is to investigate the evolution of the Turkish language since the establishment of Türkiye in 1923. Thus, we first introduce Turkronicles which is a diachronic corpus for Turkish derived from the Official Gazette of Türkiye. Turkronicles contains 45,375 documents, detailing governmental actions, making it a pivotal resource for analyzing the linguistic evolution influenced by the state policies. In addition, we expand an existing diachronic Turkish corpus which consists of the records of the Grand National Assembly of Türkiye by covering additional years. Next, combining these two diachronic corpora, we seek answers for two main research questions: How have the Turkish vocabulary and the writing conventions changed since the 1920s? Our analysis reveals that the vocabularies of two different time periods diverge more as the time between them increases, and newly coined Turkish words take the place of their old counterparts. We also observe changes in writing conventions. In particular, the use of circumflex noticeably decreases and words ending with the letters "-b" and "-d" are successively replaced with "-p" and "-t" letters, respectively. Overall, this study quantitatively highlights the dramatic changes in Turkish from various aspects of the language in a diachronic perspective.

Turkronicles: Diachronic Resources for the Fast Evolving Turkish Language

TL;DR

Turkronicles provides the first large-scale diachronic Turkish resource combining Official Gazette and parliamentary records to analyze vocabulary and writing-convention changes since 1920. By building a diachronic pipeline with preprocessing, a modern-old Turkish dictionary, n-grams, and Embeddings (PPMI, SVD, CBOW) aligned via Orthogonal Procrustes, the work reveals substantial vocabulary turnover and a shift away from Arabic/Persian lexicon toward Turkish-origin terms, along with declining circumflex usage and evolving word-final phonology. The accompanying Lingan Python library enables reproducible, extensible diachronic analyses, and the study demonstrates concrete patterns such as increased divergence with time and substitution of old terms by new equivalents, underscoring the impact of state-driven language reform. The dataset and tools have broad implications for historical linguistics, sociolinguistics, and NLP in Turkish, and the authors plan to extend the corpus with additional sources and broader access to foster further research.

Abstract

Over the past century, the Turkish language has undergone substantial changes, primarily driven by governmental interventions. In this work, our goal is to investigate the evolution of the Turkish language since the establishment of Türkiye in 1923. Thus, we first introduce Turkronicles which is a diachronic corpus for Turkish derived from the Official Gazette of Türkiye. Turkronicles contains 45,375 documents, detailing governmental actions, making it a pivotal resource for analyzing the linguistic evolution influenced by the state policies. In addition, we expand an existing diachronic Turkish corpus which consists of the records of the Grand National Assembly of Türkiye by covering additional years. Next, combining these two diachronic corpora, we seek answers for two main research questions: How have the Turkish vocabulary and the writing conventions changed since the 1920s? Our analysis reveals that the vocabularies of two different time periods diverge more as the time between them increases, and newly coined Turkish words take the place of their old counterparts. We also observe changes in writing conventions. In particular, the use of circumflex noticeably decreases and words ending with the letters "-b" and "-d" are successively replaced with "-p" and "-t" letters, respectively. Overall, this study quantitatively highlights the dramatic changes in Turkish from various aspects of the language in a diachronic perspective.
Paper Structure (24 sections, 2 equations, 11 figures, 5 tables)

This paper contains 24 sections, 2 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: An example of a hierarchical tree structure consisting of Corpus and DiachronicCorpus objects. Nodes with D and C represent DiachronicCorpus, respectively. $D_2$ contains two Corpus objects, $C_3$ and $C_4$. $C_1$ and $D_2$ together compose $D_1$.
  • Figure 2: An example usage of Lingan. This code piece computes the relative frequency of belge (document) across time periods through pre-defined function Frequency.
  • Figure 3: Defining a new Data component to model sentences in the corpus.
  • Figure 4: Defining a new Operation to calculate the total number of sentences on a diachronic corpus structure.
  • Figure 5: The number of unique lemmas/stems for each 10-years time period.
  • ...and 6 more figures