Table of Contents
Fetching ...

Transformer-Enabled Diachronic Analysis of Vedic Sanskrit: Neural Methods for Quantifying Types of Language Change

Ananth Hariharan, David Mortensen

TL;DR

This work introduces a weakly supervised neural-symbolic framework to quantify diachronic linguistic change in Sanskrit, a morphologically rich and low-resource language, leveraging regex-based pseudo-labeling to fine-tune a multilingual BERT and a confidence-weighted ensemble to fuse neural and symbolic signals. Applied to a 1.47-million-word corpus spanning 2,000+ years, the approach achieves 52.4% feature-detection with well-calibrated uncertainty (r=0.92, ECE=0.043), challenging the notion of inevitable simplification by showing redistribution of morphological complexity. Key findings include stable overall detection across periods, substantial increases in compounding and philosophical vocabulary, and notable persistence of archaic features, all quantified through robust statistical analyses (OLS, Spearman, PCA) and validated against expert gold standards. This work establishes a reproducible framework for AI-augmented historical linguistics and demonstrates its applicability to other morphologically rich, data-sparse languages.

Abstract

This study demonstrates how hybrid neural-symbolic methods can yield significant new insights into the evolution of a morphologically rich, low-resource language. We challenge the naive assumption that linguistic change is simplification by quantitatively analyzing over 2,000 years of Sanskrit, demonstrating how weakly-supervised hybrid methods can yield new insights into the evolution of morphologically rich, low-resource languages. Our approach addresses data scarcity through weak supervision, using 100+ high-precision regex patterns to generate pseudo-labels for fine-tuning a multilingual BERT. We then fuse symbolic and neural outputs via a novel confidence-weighted ensemble, creating a system that is both scalable and interpretable. Applying this framework to a 1.47-million-word diachronic corpus, our ensemble achieves a 52.4% overall feature detection rate. Our findings reveal that Sanskrit's overall morphological complexity does not decrease but is instead dynamically redistributed: while earlier verbal features show cyclical patterns of decline, complexity shifts to other domains, evidenced by a dramatic expansion in compounding and the emergence of new philosophical terminology. Critically, our system produces well-calibrated uncertainty estimates, with confidence strongly correlating with accuracy (Pearson r = 0.92) and low overall calibration error (ECE = 0.043), bolstering the reliability of these findings for computational philology.

Transformer-Enabled Diachronic Analysis of Vedic Sanskrit: Neural Methods for Quantifying Types of Language Change

TL;DR

This work introduces a weakly supervised neural-symbolic framework to quantify diachronic linguistic change in Sanskrit, a morphologically rich and low-resource language, leveraging regex-based pseudo-labeling to fine-tune a multilingual BERT and a confidence-weighted ensemble to fuse neural and symbolic signals. Applied to a 1.47-million-word corpus spanning 2,000+ years, the approach achieves 52.4% feature-detection with well-calibrated uncertainty (r=0.92, ECE=0.043), challenging the notion of inevitable simplification by showing redistribution of morphological complexity. Key findings include stable overall detection across periods, substantial increases in compounding and philosophical vocabulary, and notable persistence of archaic features, all quantified through robust statistical analyses (OLS, Spearman, PCA) and validated against expert gold standards. This work establishes a reproducible framework for AI-augmented historical linguistics and demonstrates its applicability to other morphologically rich, data-sparse languages.

Abstract

This study demonstrates how hybrid neural-symbolic methods can yield significant new insights into the evolution of a morphologically rich, low-resource language. We challenge the naive assumption that linguistic change is simplification by quantitatively analyzing over 2,000 years of Sanskrit, demonstrating how weakly-supervised hybrid methods can yield new insights into the evolution of morphologically rich, low-resource languages. Our approach addresses data scarcity through weak supervision, using 100+ high-precision regex patterns to generate pseudo-labels for fine-tuning a multilingual BERT. We then fuse symbolic and neural outputs via a novel confidence-weighted ensemble, creating a system that is both scalable and interpretable. Applying this framework to a 1.47-million-word diachronic corpus, our ensemble achieves a 52.4% overall feature detection rate. Our findings reveal that Sanskrit's overall morphological complexity does not decrease but is instead dynamically redistributed: while earlier verbal features show cyclical patterns of decline, complexity shifts to other domains, evidenced by a dramatic expansion in compounding and the emergence of new philosophical terminology. Critically, our system produces well-calibrated uncertainty estimates, with confidence strongly correlating with accuracy (Pearson r = 0.92) and low overall calibration error (ECE = 0.043), bolstering the reliability of these findings for computational philology.

Paper Structure

This paper contains 21 sections, 11 equations, 3 figures, 6 tables, 1 algorithm.

Figures (3)

  • Figure 1: Agreement Analysis
  • Figure 2: Feature Trends
  • Figure 3: Trend of Subjunctive Cases