Detecting Structured Language Alternations in Historical Documents by Combining Language Identification with Fourier Analysis
Hale Sirin, Sabrina Li, Tom Lippincott
TL;DR
This work addresses detecting structured language alternations in historical multilingual documents, focusing on Armeno-Turkish. It introduces a workflow that converts time-domain language-probability signals, derived from per-50-word windows, into the frequency domain via the discrete Fourier transform and uses clustering to identify distinct alternation patterns. Empirical results on HathiTrust data reveal three pattern classes and uncover 30 new Armeno-Turkish records, while also highlighting OCR-related noise as a key challenge. The approach provides a scalable, language-agnostic workflow to extract monolingual segments and informs resource-building for low-resource, historical languages.
Abstract
In this study, we present a generalizable workflow to identify documents in a historic language with a nonstandard language and script combination, Armeno-Turkish. We introduce the task of detecting distinct patterns of multilinguality based on the frequency of structured language alternations within a document.
