Table of Contents
Fetching ...

Pretraining Language Models for Diachronic Linguistic Change Discovery

Elisabeth Fittschen, Sabrina Li, Tom Lippincott, Leshem Choshen, Craig Messner

TL;DR

This work tackles the challenge of diachronic linguistics with domain-restricted pretraining to guarantee period-specific knowledge. It builds five 10-million-word time slices via a date-attribution pipeline and trains two model batteries per slice: pretrained BabyLlama-2 style models and finetuned DoRA adapters on Llama3-8B. The study finds that pretrained models train faster and better preserve historical divisions, while finetuned models achieve higher BLiMP performance but risk leakage across time, enabling new diachronic analyses. The resulting pipeline supports automated hypothesis discovery about lexical, grammatical, and sense-change phenomena, with potential applicability to other fields requiring boundary-aware historical analysis.

Abstract

Large language models (LLMs) have shown potential as tools for scientific discovery. This has engendered growing interest in their use in humanistic disciplines, such as historical linguistics and literary studies. These fields often construct arguments on the basis of delineations like genre, or more inflexibly, time period. Although efforts have been made to restrict inference to specific domains via fine-tuning or model editing, we posit that the only true guarantee is domain-restricted pretraining -- typically, a data- and compute-expensive proposition. We show that efficient pretraining techniques can produce useful models over corpora too large for easy manual inspection but too small for "typical" LLM approaches. We employ a novel date-attribution pipeline in order to obtain a temporally-segmented dataset of five 10-million-word slices. We train two corresponding five-model batteries over these corpus segments, efficient pretraining and Llama3-8B parameter efficiently finetuned. We find that the pretrained models are faster to train than the finetuned baselines and that they better respect the historical divisions of our corpus. Emphasizing speed and precision over a-historical comprehensiveness enables a number of novel approaches to hypothesis discovery and testing in our target fields. Taking up diachronic linguistics as a testbed, we show that our method enables the detection of a diverse set of phenomena, including en masse lexical change, non-lexical (grammatical and morphological) change, and word sense introduction/obsolescence. We provide a ready-to-use pipeline that allows extension of our approach to other target fields with only minimal adaptation.

Pretraining Language Models for Diachronic Linguistic Change Discovery

TL;DR

This work tackles the challenge of diachronic linguistics with domain-restricted pretraining to guarantee period-specific knowledge. It builds five 10-million-word time slices via a date-attribution pipeline and trains two model batteries per slice: pretrained BabyLlama-2 style models and finetuned DoRA adapters on Llama3-8B. The study finds that pretrained models train faster and better preserve historical divisions, while finetuned models achieve higher BLiMP performance but risk leakage across time, enabling new diachronic analyses. The resulting pipeline supports automated hypothesis discovery about lexical, grammatical, and sense-change phenomena, with potential applicability to other fields requiring boundary-aware historical analysis.

Abstract

Large language models (LLMs) have shown potential as tools for scientific discovery. This has engendered growing interest in their use in humanistic disciplines, such as historical linguistics and literary studies. These fields often construct arguments on the basis of delineations like genre, or more inflexibly, time period. Although efforts have been made to restrict inference to specific domains via fine-tuning or model editing, we posit that the only true guarantee is domain-restricted pretraining -- typically, a data- and compute-expensive proposition. We show that efficient pretraining techniques can produce useful models over corpora too large for easy manual inspection but too small for "typical" LLM approaches. We employ a novel date-attribution pipeline in order to obtain a temporally-segmented dataset of five 10-million-word slices. We train two corresponding five-model batteries over these corpus segments, efficient pretraining and Llama3-8B parameter efficiently finetuned. We find that the pretrained models are faster to train than the finetuned baselines and that they better respect the historical divisions of our corpus. Emphasizing speed and precision over a-historical comprehensiveness enables a number of novel approaches to hypothesis discovery and testing in our target fields. Taking up diachronic linguistics as a testbed, we show that our method enables the detection of a diverse set of phenomena, including en masse lexical change, non-lexical (grammatical and morphological) change, and word sense introduction/obsolescence. We provide a ready-to-use pipeline that allows extension of our approach to other target fields with only minimal adaptation.

Paper Structure

This paper contains 20 sections, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Cross-time perplexities
  • Figure 2: Model performance on the top 100 completion cloze task
  • Figure 3: Probability of Leakage, over pretrained and finetuned models.
  • Figure 4: Natural appearances of "station" with a descending probability trajectory and manually labelled for sense.
  • Figure 5: Count of cloze tasks for per time slice for the set filtered for our data (14.6 thousand examples).
  • ...and 3 more figures