AcrosticSleuth: Probabilistic Identification and Ranking of Acrostics in Multilingual Corpora
Aleksandr Fedchin, Isabel Cooperman, Pramit Chaudhuri, Joseph P. Dexter
TL;DR
AcrosticSleuth tackles the challenge of detecting intentional acrostics in large multilingual corpora by formulating identification as a binary classification and ranking candidate sequences via the likelihood ratio $P(s|a)/P(s|not a)$, approximated for small $P(a)$. The approach employs a SentencePiece unigram language model to estimate $P(s|a)$, enables efficient dynamic-programming-based ranking, and scales to large corpora with multithreading. A multilingual AcrostID dataset is introduced, derived from WikiSource across English, French, and Russian, with evaluation showing F1 scores of $0.39$, $0.59$, and $0.66$ respectively and the discovery of new acrostics like ARSPOETICA and THOMAS[OF]HOBBES. The work contributes an open, scalable tool plus a labeled dataset, enabling automated study of wordplay across languages and setting the stage for extending to other wordplay forms.
Abstract
For centuries, writers have hidden messages in their texts as acrostics, where initial letters of consecutive lines or paragraphs form meaningful words or phrases. Scholars searching for acrostics manually can only focus on a few authors at a time and often favor qualitative arguments in discussing intentionally. We aim to put the study of acrostics on firmer statistical footing by presenting AcrosticSleuth, a first-of-its-kind tool that automatically identifies acrostics and ranks them by the probability that the sequence of characters does not occur by chance (and therefore may have been inserted intentionally). Acrostics are rare, so we formalize the problem as a binary classification task in the presence of extreme class imbalance. To evaluate AcrosticSleuth, we present the Acrostic Identification Dataset (AcrostID), a collection of acrostics from the WikiSource online database. Despite the class imbalance, AcrosticSleuth achieves F1 scores of 0.39, 0.59, and 0.66 on French, English, and Russian subdomains of WikiSource, respectively. We further demonstrate that AcrosticSleuth can identify previously unknown high-profile instances of wordplay, such as the acrostic spelling ARSPOETICA (``art of poetry") by Italian Humanist Albertino Mussato and English philosopher Thomas Hobbes' signature in the opening paragraphs of The Elements of Law.
