Table of Contents
Fetching ...

Syntactic Language Change in English and German: Metrics, Parsers, and Convergences

Yanran Chen, Wei Zhao, Anne Breitbarth, Manuel Stoeckel, Alexander Mehler, Steffen Eger

TL;DR

This study investigates diachronic syntactic change in English and German by analyzing parliamentary debates over roughly 160 years with five parsers and 15 metrics related to dependency distance minimization and tree-graph properties. It reveals that parser choice materially affects observed trends and demonstrates a general convergence between English and German across most metrics, with German occasionally varying more. The work introduces a robust, multi-parser framework and three evaluation domains (UD treebanks, target treebanks, adversarial treebanks) to assess reliability and uses a majority-vote approach to stabilize trend detection. The findings show that significant syntactic changes cluster at sentence-length tails and that many non-distance metrics align across languages, providing a comprehensive, modern NLP perspective on historical syntax with practical implications for diachronic NLP tasks. Together, these results caution against relying on a single parser for syntactic-change studies and highlight subtle cross-language patterns in long-span syntactic evolution.

Abstract

Many studies have shown that human languages tend to optimize for lower complexity and increased communication efficiency. Syntactic dependency distance, which measures the linear distance between dependent words, is often considered a key indicator of language processing difficulty and working memory load. The current paper looks at diachronic trends in syntactic language change in both English and German, using corpora of parliamentary debates from the last c. 160 years. We base our observations on five dependency parsers, including the widely used Stanford CoreNLP as well as 4 newer alternatives. Our analysis of syntactic language change goes beyond linear dependency distance and explores 15 metrics relevant to dependency distance minimization (DDM) and/or based on tree graph properties, such as the tree height and degree variance. Even though we have evidence that recent parsers trained on modern treebanks are not heavily affected by data 'noise' such as spelling changes and OCR errors in our historic data, we find that results of syntactic language change are sensitive to the parsers involved, which is a caution against using a single parser for evaluating syntactic language change as done in previous work. We also show that syntactic language change over the time period investigated is largely similar between English and German for the different metrics explored: only 4% of cases we examine yield opposite conclusions regarding upwards and downtrends of syntactic metrics across German and English. We also show that changes in syntactic measures seem to be more frequent at the tails of sentence length distributions. To our best knowledge, ours is the most comprehensive analysis of syntactic language change using modern NLP technology in recent corpora of English and German.

Syntactic Language Change in English and German: Metrics, Parsers, and Convergences

TL;DR

This study investigates diachronic syntactic change in English and German by analyzing parliamentary debates over roughly 160 years with five parsers and 15 metrics related to dependency distance minimization and tree-graph properties. It reveals that parser choice materially affects observed trends and demonstrates a general convergence between English and German across most metrics, with German occasionally varying more. The work introduces a robust, multi-parser framework and three evaluation domains (UD treebanks, target treebanks, adversarial treebanks) to assess reliability and uses a majority-vote approach to stabilize trend detection. The findings show that significant syntactic changes cluster at sentence-length tails and that many non-distance metrics align across languages, providing a comprehensive, modern NLP perspective on historical syntax with practical implications for diachronic NLP tasks. Together, these results caution against relying on a single parser for syntactic-change studies and highlight subtle cross-language patterns in long-span syntactic evolution.

Abstract

Many studies have shown that human languages tend to optimize for lower complexity and increased communication efficiency. Syntactic dependency distance, which measures the linear distance between dependent words, is often considered a key indicator of language processing difficulty and working memory load. The current paper looks at diachronic trends in syntactic language change in both English and German, using corpora of parliamentary debates from the last c. 160 years. We base our observations on five dependency parsers, including the widely used Stanford CoreNLP as well as 4 newer alternatives. Our analysis of syntactic language change goes beyond linear dependency distance and explores 15 metrics relevant to dependency distance minimization (DDM) and/or based on tree graph properties, such as the tree height and degree variance. Even though we have evidence that recent parsers trained on modern treebanks are not heavily affected by data 'noise' such as spelling changes and OCR errors in our historic data, we find that results of syntactic language change are sensitive to the parsers involved, which is a caution against using a single parser for evaluating syntactic language change as done in previous work. We also show that syntactic language change over the time period investigated is largely similar between English and German for the different metrics explored: only 4% of cases we examine yield opposite conclusions regarding upwards and downtrends of syntactic metrics across German and English. We also show that changes in syntactic measures seem to be more frequent at the tails of sentence length distributions. To our best knowledge, ours is the most comprehensive analysis of syntactic language change using modern NLP technology in recent corpora of English and German.
Paper Structure (49 sections, 1 equation, 21 figures, 6 tables)

This paper contains 49 sections, 1 equation, 21 figures, 6 tables.

Figures (21)

  • Figure 1: Dependency relations of sentence " But there is no proof." in linear order (left) and tree graph (right).
  • Figure 2: 4-step pipeline for sentence extraction from the corpora: 1. paragraph-level preprocessing, 2. sentence segmentation with Spacy, 3. postprocessing, 4. filtering.
  • Figure 3: Distribution of the texts identified as perfect sentences (without any issues), sentences with issues, and non-sentences over time.
  • Figure 4: (a)/(c): Percentage of perfect sentences and sentences with a specific issue to all texts identified as sentences. (b): Percentage of sentences with issues from a specific origin to all sentences containing issues.
  • Figure 5: Average sentence length per decade group over time.
  • ...and 16 more figures