Table of Contents
Fetching ...

Decoding the Past: Explainable Machine Learning Models for Dating Historical Texts

Paulo J. N. Pinto, Armando J. Pinho, Diogo Pratas

TL;DR

The paper tackles the challenge of dating historical English texts by developing an interpretable framework that fuses five feature families—compression-based signals, lexical/structural features, readability metrics, neologism detection, and function-word distance patterns—within tree-based models. It demonstrates that combining these signals yields strong century-level accuracy (76.7%) and meaningful decade-level signals (26.1%), with high ranking and top-k performance, while offering robust explainability via feature importance and SHAP analyses. The approach leverages a large multisource corpus (Open Library and Project Gutenberg), provides a public software release, and critically assesses cross-dataset generalization, revealing domain-adaptation challenges but maintaining practical utility for digital humanities workflows. The work highlights 19th-century linguistic transitions as a pivot across feature domains and presents a scalable, interpretable alternative to neural methods for historical text dating and related authentication tasks.

Abstract

Accurately dating historical texts is essential for organizing and interpreting cultural heritage collections. This article addresses temporal text classification using interpretable, feature-engineered tree-based machine learning models. We integrate five feature categories - compression-based, lexical structure, readability, neologism detection, and distance features - to predict the temporal origin of English texts spanning five centuries. Comparative analysis shows that these feature domains provide complementary temporal signals, with combined models outperforming any individual feature set. On a large-scale corpus, we achieve 76.7% accuracy for century-scale prediction and 26.1% for decade-scale classification, substantially above random baselines (20% and 2.3%). Under relaxed temporal precision, performance increases to 96.0% top-2 accuracy for centuries and 85.8% top-10 accuracy for decades. The final model exhibits strong ranking capabilities with AUCROC up to 94.8% and AUPRC up to 83.3%, and maintains controlled errors with mean absolute deviations of 27 years and 30 years, respectively. For authentication-style tasks, binary models around key thresholds (e.g., 1850-1900) reach 85-98% accuracy. Feature importance analysis identifies distance features and lexical structure as most informative, with compression-based features providing complementary signals. SHAP explainability reveals systematic linguistic evolution patterns, with the 19th century emerging as a pivot point across feature domains. Cross-dataset evaluation on Project Gutenberg highlights domain adaptation challenges, with accuracy dropping by 26.4 percentage points, yet the computational efficiency and interpretability of tree-based models still offer a scalable, explainable alternative to neural architectures.

Decoding the Past: Explainable Machine Learning Models for Dating Historical Texts

TL;DR

The paper tackles the challenge of dating historical English texts by developing an interpretable framework that fuses five feature families—compression-based signals, lexical/structural features, readability metrics, neologism detection, and function-word distance patterns—within tree-based models. It demonstrates that combining these signals yields strong century-level accuracy (76.7%) and meaningful decade-level signals (26.1%), with high ranking and top-k performance, while offering robust explainability via feature importance and SHAP analyses. The approach leverages a large multisource corpus (Open Library and Project Gutenberg), provides a public software release, and critically assesses cross-dataset generalization, revealing domain-adaptation challenges but maintaining practical utility for digital humanities workflows. The work highlights 19th-century linguistic transitions as a pivot across feature domains and presents a scalable, interpretable alternative to neural methods for historical text dating and related authentication tasks.

Abstract

Accurately dating historical texts is essential for organizing and interpreting cultural heritage collections. This article addresses temporal text classification using interpretable, feature-engineered tree-based machine learning models. We integrate five feature categories - compression-based, lexical structure, readability, neologism detection, and distance features - to predict the temporal origin of English texts spanning five centuries. Comparative analysis shows that these feature domains provide complementary temporal signals, with combined models outperforming any individual feature set. On a large-scale corpus, we achieve 76.7% accuracy for century-scale prediction and 26.1% for decade-scale classification, substantially above random baselines (20% and 2.3%). Under relaxed temporal precision, performance increases to 96.0% top-2 accuracy for centuries and 85.8% top-10 accuracy for decades. The final model exhibits strong ranking capabilities with AUCROC up to 94.8% and AUPRC up to 83.3%, and maintains controlled errors with mean absolute deviations of 27 years and 30 years, respectively. For authentication-style tasks, binary models around key thresholds (e.g., 1850-1900) reach 85-98% accuracy. Feature importance analysis identifies distance features and lexical structure as most informative, with compression-based features providing complementary signals. SHAP explainability reveals systematic linguistic evolution patterns, with the 19th century emerging as a pivot point across feature domains. Cross-dataset evaluation on Project Gutenberg highlights domain adaptation challenges, with accuracy dropping by 26.4 percentage points, yet the computational efficiency and interpretability of tree-based models still offer a scalable, explainable alternative to neural architectures.

Paper Structure

This paper contains 48 sections, 3 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Compression features classification: a) Tree-based importance, b) Permutation importance, c) Feature correlations.
  • Figure 2: Lexical structure features in centuries classification: a) Tree-based importance, b) Permutation importance, c) Feature correlations.
  • Figure 3: Distance features in centuries classification: a) Tree-based importance, b) Permutation importance, c) Feature correlations.
  • Figure 4: Neologism features in centuries classification: a) Tree-based importance, b) Permutation importance, c) Feature correlations.
  • Figure 5: Model accuracy for century (left heatmap) and decade (right heatmap) classification on the test dataset.
  • ...and 2 more figures