Analyzing Political Text at Scale with Online Tensor LDA
Sara Kangaslahti, Danny Ebanks, Jean Kossaifi, Anqi Liu, R. Michael Alvarez, Animashree Anandkumar
TL;DR
The paper introduces TLDA, a scalable, theoretically grounded Tensor LDA approach that learns topic models on extremely large text corpora through online centering, batched processing, and streaming updates. By centering first and incrementally decomposing centered second- and third-order moments, TLDA achieves provable identifiability and favorable sample-complexity guarantees while running efficiently on GPUs. The authors provide an end-to-end open-source TensorLy-based implementation, demonstrate linear scaling to over a billion documents, and validate the method with large-scale applications to #MeToo and 2020 election discourse, including qualitative insights into movement dynamics. This work offers social scientists a practical, real-time capable tool for analyzing massive unstructured political text and discovering theoretically meaningful topics at unprecedented scale.
Abstract
This paper proposes a topic modeling method that scales linearly to billions of documents. We make three core contributions: i) we present a topic modeling method, Tensor Latent Dirichlet Allocation (TLDA), that has identifiable and recoverable parameter guarantees and sample complexity guarantees for large data; ii) we show that this method is computationally and memory efficient (achieving speeds over 3-4x those of prior parallelized Latent Dirichlet Allocation (LDA) methods), and that it scales linearly to text datasets with over a billion documents; iii) we provide an open-source, GPU-based implementation, of this method. This scaling enables previously prohibitive analyses, and we perform two real-world, large-scale new studies of interest to political scientists: we provide the first thorough analysis of the evolution of the #MeToo movement through the lens of over two years of Twitter conversation and a detailed study of social media conversations about election fraud in the 2020 presidential election. Thus this method provides social scientists with the ability to study very large corpora at scale and to answer important theoretically-relevant questions about salient issues in near real-time.
