Table of Contents
Fetching ...

Temporal Sequencing of Documents

Michael Gervers, Gelila Tilahun

TL;DR

TempSeq presents an unsupervised framework to temporally rank undated historical documents by exploiting gradual changes in word usage through a local kernel GLM, with bandwidth as the key smoothing parameter. A simulated annealing search over document permutations, guided by per-word bandwidth statistics, yields the temporal ordering that maximizes overall smoothing, enabling automatic dating without labeled data. Evaluation on SOTU and DEEDS corpora shows substantial gains over random baselines and reveals informative terms that anchor temporal signals, though performance declines for very short texts. The method offers a practical tool for heritage texts and forged or misdated documents, with future work targeting broader time gaps and Anglo-Saxon material.

Abstract

We outline an unsupervised method for temporal rank ordering of sets of historical documents, namely American State of the Union Addresses and DEEDS, a corpus of medieval English property transfer documents. Our method relies upon effectively capturing the gradual change in word usage via a bandwidth estimate for the non-parametric Generalized Linear Models (Fan, Heckman, and Wand, 1995). The number of possible rank orders needed to search through for cost functions related to the bandwidth can be quite large, even for a small set of documents. We tackle this problem of combinatorial optimization using the Simulated Annealing algorithm, which allows us to obtain the optimal document temporal orders. Our rank ordering method significantly improved the temporal sequencing of both corpora compared to a randomly sequenced baseline. This unsupervised approach should enable the temporal ordering of undated document sets.

Temporal Sequencing of Documents

TL;DR

TempSeq presents an unsupervised framework to temporally rank undated historical documents by exploiting gradual changes in word usage through a local kernel GLM, with bandwidth as the key smoothing parameter. A simulated annealing search over document permutations, guided by per-word bandwidth statistics, yields the temporal ordering that maximizes overall smoothing, enabling automatic dating without labeled data. Evaluation on SOTU and DEEDS corpora shows substantial gains over random baselines and reveals informative terms that anchor temporal signals, though performance declines for very short texts. The method offers a practical tool for heritage texts and forged or misdated documents, with future work targeting broader time gaps and Anglo-Saxon material.

Abstract

We outline an unsupervised method for temporal rank ordering of sets of historical documents, namely American State of the Union Addresses and DEEDS, a corpus of medieval English property transfer documents. Our method relies upon effectively capturing the gradual change in word usage via a bandwidth estimate for the non-parametric Generalized Linear Models (Fan, Heckman, and Wand, 1995). The number of possible rank orders needed to search through for cost functions related to the bandwidth can be quite large, even for a small set of documents. We tackle this problem of combinatorial optimization using the Simulated Annealing algorithm, which allows us to obtain the optimal document temporal orders. Our rank ordering method significantly improved the temporal sequencing of both corpora compared to a randomly sequenced baseline. This unsupervised approach should enable the temporal ordering of undated document sets.
Paper Structure (15 sections, 24 equations, 9 figures)

This paper contains 15 sections, 24 equations, 9 figures.

Figures (9)

  • Figure 1: Asterisks show the proportion of occurrences of the words Drug(s) in the SOTU corpus. The solid curve is based on a larger bandwidth value than that of the dashed-lined curve. The dotted curve (the horizontal dotted line) is based on a very large bandwidth value. Date (time) is the $x$-axis and $\hat{\pi}_{w,h}(t)$ is the $y$-axis.
  • Figure 2: Asterisks show the proportion of occurrences of the phrase Angl(ic)is in the DEEDS corpus. The solid curve is based on a larger bandwidth value than that of the dashed-lined curve. Date (time) is the $x$-axis and $\hat{\pi}_{w,h}(t)$ is the $y$-axis.
  • Figure 3: Asterisks show the proportion of occurrences of the word de (of). The smoothed solid probability curve is uniform across the date range. Date (time) is the $x$-axis and $\hat{\pi}_{w,h}(t)$ is the $y$-axis.
  • Figure 4: Box plots of $H_{\sigma(l)}$ (Bandwidths) versus temporal orders for 100 randomly selected sets of ten documents. In all figures, the first two box plots are that of $H_{\sigma(l)}$ for temporally randomly permuted and unpermuted sets of documents. Figure \ref{['boxplotSofUHsigma']} shows the results for the SOTU corpus. Figures \ref{['boxplotDEEDSHsigma']} and \ref{['boxplotDEEDSSingleHsigma']} show the results for the DEEDS corpus (DEEDS-conflated and DEEDS-single, respectively).
  • Figure 5: Box plots of the correlation coefficients (in absolute terms) of the estimated rank orders of sets of 10 documents and their true rank orders, replicated 100 times. The first plot corresponds to the State of the Union Address corpus (SOTU), the second to the DEEDS-conflated corpus, and the final plot is the baseline (random).
  • ...and 4 more figures