Table of Contents
Fetching ...

A ripple in time: a discontinuity in American history

Alexander Kolpakov, Igor Rivin

TL;DR

The paper tackles the challenge of extracting temporal structure and authorship signals from a compact, variable-length historical text corpus (State of the Union addresses). It employs transformer-based embeddings (GPT-2, DistilBERT) combined with nonlinear dimension reduction (UMAP, TriMAP, PaCMAP) to reveal latent temporal patterns and to perform authorship attribution without heavy fine-tuning. Key findings include near-90–95% author attribution accuracy and year estimates within about a presidential term, with a pronounced historical ripple around the late 1920s that appears across methods. The work demonstrates that modern embedding and clustering techniques can uncover meaningful historical patterns in small corpora and are readily reproducible on standard hardware via public code and data.

Abstract

In this technical note we suggest a novel approach to discover temporal (related and unrelated to language dilation) and personality (authorship attribution) aspects in historical datasets. We exemplify our approach on the State of the Union addresses given by the past 42 US presidents: this dataset is known for its relatively small amount of data, and high variability of the size and style of texts. Nevertheless, we manage to achieve about 95\% accuracy on the authorship attribution task, and pin down the date of writing to a single presidential term.

A ripple in time: a discontinuity in American history

TL;DR

The paper tackles the challenge of extracting temporal structure and authorship signals from a compact, variable-length historical text corpus (State of the Union addresses). It employs transformer-based embeddings (GPT-2, DistilBERT) combined with nonlinear dimension reduction (UMAP, TriMAP, PaCMAP) to reveal latent temporal patterns and to perform authorship attribution without heavy fine-tuning. Key findings include near-90–95% author attribution accuracy and year estimates within about a presidential term, with a pronounced historical ripple around the late 1920s that appears across methods. The work demonstrates that modern embedding and clustering techniques can uncover meaningful historical patterns in small corpora and are readily reproducible on standard hardware via public code and data.

Abstract

In this technical note we suggest a novel approach to discover temporal (related and unrelated to language dilation) and personality (authorship attribution) aspects in historical datasets. We exemplify our approach on the State of the Union addresses given by the past 42 US presidents: this dataset is known for its relatively small amount of data, and high variability of the size and style of texts. Nevertheless, we manage to achieve about 95\% accuracy on the authorship attribution task, and pin down the date of writing to a single presidential term.
Paper Structure (15 sections, 8 figures)

This paper contains 15 sections, 8 figures.

Figures (8)

  • Figure 1: UMAP visualizations of the GPT--2 embedding of SOTU
  • Figure 2: Temporal clustering of SOTU embeddings
  • Figure 3: 2D visualizations of the GPT--2 embedding of SOTU
  • Figure 4: 3D visualizations of the GPT--2 embedding of SOTU
  • Figure 5: Temporal clustering of SOTU addresses: GPT--2 embedding followed by a dimension reduction technique
  • ...and 3 more figures