The study of short texts in digital politics: Document aggregation for topic modeling
Nitheesha Nakka, Omer F. Yalcin, Bruce A. Desmarais, Sarah Rajtmajer, Burt Monroe
TL;DR
Problem: document length and aggregation level can shape topic-model outputs in political text analysis. Approach: analyze one million tweets from U.S. state legislators using Structural Topic Models, comparing per-legislator aggregation to per-tweet documents, and replicate with Wikipedia pages by birth city; evaluate 60- and 120-topic configurations and predictive validity; include a transformer-based BERTopic analysis in the appendix. Key findings: legislator-level aggregation yields more state-related topics and higher predictive accuracy for state membership than tweet-level models, with AIC of 12,032.07 (Legislator) versus 23,489.63 (Tweets) and out-of-sample accuracy of 85.7% versus 59.2% in the 120-topic specification. Significance: demonstrates aggregation as a crucial preprocessing choice in political text analysis, with recommendations to use 2–3 aggregation levels for descriptive work and cross-validation for predictive tasks, and suggests developing cross-aggregation fit measures.
Abstract
Statistical topic modeling is widely used in political science to study text. Researchers examine documents of varying lengths, from tweets to speeches. There is ongoing debate on how document length affects the interpretability of topic models. We investigate the effects of aggregating short documents into larger ones based on natural units that partition the corpus. In our study, we analyze one million tweets by U.S. state legislators from April 2016 to September 2020. We find that for documents aggregated at the account level, topics are more associated with individual states than when using individual tweets. This finding is replicated with Wikipedia pages aggregated by birth cities, showing how document definitions can impact topic modeling results.
