Historia Magistra Vitae: Dynamic Topic Modeling of Roman Literature using Neural Embeddings
Michael Ginn, Mans Hulden
TL;DR
This paper tackles the challenge of applying dynamic topic models to the expansive, noisy, and long-span corpus of surviving Roman literature. It benchmarks LDA, NMF, and BERTopic (embedding-based clustering) across the Latin Library corpus segmented into $10$ time slices, evaluating with $TC$-Embed coherence and $MPJ$ generality while also considering qualitative interpretability. The authors show that while LDA/NMF score higher on quantitative metrics, BERTopic yields more interpretable topic distributions and clearer historical trends with minimal hyperparameter tuning, suggesting practical viability for historians. They provide the first dynamic topic model over the entire Roman corpus, discuss limitations such as embedding requirements for low-resource languages, and release their code on GitHub.
Abstract
Dynamic topic models have been proposed as a tool for historical analysis, but traditional approaches have had limited usefulness, being difficult to configure, interpret, and evaluate. In this work, we experiment with a recent approach for dynamic topic modeling using BERT embeddings. We compare topic models built using traditional statistical models (LDA and NMF) and the BERT-based model, modeling topics over the entire surviving corpus of Roman literature. We find that while quantitative metrics prefer statistical models, qualitative evaluation finds better insights from the neural model. Furthermore, the neural topic model is less sensitive to hyperparameter configuration and thus may make dynamic topic modeling more viable for historical researchers.
