Detecting Structured Language Alternations in Historical Documents by Combining Language Identification with Fourier Analysis

Hale Sirin; Sabrina Li; Tom Lippincott

Detecting Structured Language Alternations in Historical Documents by Combining Language Identification with Fourier Analysis

Hale Sirin, Sabrina Li, Tom Lippincott

TL;DR

This work addresses detecting structured language alternations in historical multilingual documents, focusing on Armeno-Turkish. It introduces a workflow that converts time-domain language-probability signals, derived from per-50-word windows, into the frequency domain via the discrete Fourier transform and uses clustering to identify distinct alternation patterns. Empirical results on HathiTrust data reveal three pattern classes and uncover 30 new Armeno-Turkish records, while also highlighting OCR-related noise as a key challenge. The approach provides a scalable, language-agnostic workflow to extract monolingual segments and informs resource-building for low-resource, historical languages.

Abstract

In this study, we present a generalizable workflow to identify documents in a historic language with a nonstandard language and script combination, Armeno-Turkish. We introduce the task of detecting distinct patterns of multilinguality based on the frequency of structured language alternations within a document.

Detecting Structured Language Alternations in Historical Documents by Combining Language Identification with Fourier Analysis

TL;DR

Abstract

Paper Structure (11 sections, 1 equation, 6 figures, 1 table)

This paper contains 11 sections, 1 equation, 6 figures, 1 table.

Introduction
Background
Language ID
Frequency Analysis
Materials and Methods
Data
Language ID Experimental Setup
Frequency Analysis
Results and Discussion
Error Analysis
Future Work

Figures (6)

Figure 1: The first page of the Ottoman legal code, Mejelle, published in 1889 in a bi-column bilingual format, Armenian on the left and Armeno-Turkish on the right mejelle.
Figure 2: Time domain and frequency domain representations of an alternating discrete signal.
Figure 3: Visualization of k-elbow inertia metric for optimal k in k-means clustering.
Figure 4: Time domain and frequency domain representations of the alternating language probability signal in a section of the monolingual book with page segmentation shown in Figure 4.
Figure 5: Page segmentation in the book, Commentary On the Gospel of Matthew, in Armeno-Turkish. segmentedgospel
...and 1 more figures

Detecting Structured Language Alternations in Historical Documents by Combining Language Identification with Fourier Analysis

TL;DR

Abstract

Detecting Structured Language Alternations in Historical Documents by Combining Language Identification with Fourier Analysis

Authors

TL;DR

Abstract

Table of Contents

Figures (6)