Table of Contents
Fetching ...

Analyzing Political Text at Scale with Online Tensor LDA

Sara Kangaslahti, Danny Ebanks, Jean Kossaifi, Anqi Liu, R. Michael Alvarez, Animashree Anandkumar

TL;DR

The paper introduces TLDA, a scalable, theoretically grounded Tensor LDA approach that learns topic models on extremely large text corpora through online centering, batched processing, and streaming updates. By centering first and incrementally decomposing centered second- and third-order moments, TLDA achieves provable identifiability and favorable sample-complexity guarantees while running efficiently on GPUs. The authors provide an end-to-end open-source TensorLy-based implementation, demonstrate linear scaling to over a billion documents, and validate the method with large-scale applications to #MeToo and 2020 election discourse, including qualitative insights into movement dynamics. This work offers social scientists a practical, real-time capable tool for analyzing massive unstructured political text and discovering theoretically meaningful topics at unprecedented scale.

Abstract

This paper proposes a topic modeling method that scales linearly to billions of documents. We make three core contributions: i) we present a topic modeling method, Tensor Latent Dirichlet Allocation (TLDA), that has identifiable and recoverable parameter guarantees and sample complexity guarantees for large data; ii) we show that this method is computationally and memory efficient (achieving speeds over 3-4x those of prior parallelized Latent Dirichlet Allocation (LDA) methods), and that it scales linearly to text datasets with over a billion documents; iii) we provide an open-source, GPU-based implementation, of this method. This scaling enables previously prohibitive analyses, and we perform two real-world, large-scale new studies of interest to political scientists: we provide the first thorough analysis of the evolution of the #MeToo movement through the lens of over two years of Twitter conversation and a detailed study of social media conversations about election fraud in the 2020 presidential election. Thus this method provides social scientists with the ability to study very large corpora at scale and to answer important theoretically-relevant questions about salient issues in near real-time.

Analyzing Political Text at Scale with Online Tensor LDA

TL;DR

The paper introduces TLDA, a scalable, theoretically grounded Tensor LDA approach that learns topic models on extremely large text corpora through online centering, batched processing, and streaming updates. By centering first and incrementally decomposing centered second- and third-order moments, TLDA achieves provable identifiability and favorable sample-complexity guarantees while running efficiently on GPUs. The authors provide an end-to-end open-source TensorLy-based implementation, demonstrate linear scaling to over a billion documents, and validate the method with large-scale applications to #MeToo and 2020 election discourse, including qualitative insights into movement dynamics. This work offers social scientists a practical, real-time capable tool for analyzing massive unstructured political text and discovering theoretically meaningful topics at unprecedented scale.

Abstract

This paper proposes a topic modeling method that scales linearly to billions of documents. We make three core contributions: i) we present a topic modeling method, Tensor Latent Dirichlet Allocation (TLDA), that has identifiable and recoverable parameter guarantees and sample complexity guarantees for large data; ii) we show that this method is computationally and memory efficient (achieving speeds over 3-4x those of prior parallelized Latent Dirichlet Allocation (LDA) methods), and that it scales linearly to text datasets with over a billion documents; iii) we provide an open-source, GPU-based implementation, of this method. This scaling enables previously prohibitive analyses, and we perform two real-world, large-scale new studies of interest to political scientists: we provide the first thorough analysis of the evolution of the #MeToo movement through the lens of over two years of Twitter conversation and a detailed study of social media conversations about election fraud in the 2020 presidential election. Thus this method provides social scientists with the ability to study very large corpora at scale and to answer important theoretically-relevant questions about salient issues in near real-time.

Paper Structure

This paper contains 48 sections, 38 equations, 11 figures, 12 tables.

Figures (11)

  • Figure 1: Evolution of the most prominent pro- and counter-movement topics in the #MeToo discussion. In each iteration of the dynamic analysis described in Section \ref{['ssec:qualitative']}, we inspect the topics and manually label them, as well as classify them as pro- or counter- #MeToo. We then display the topic in each category with the highest weight $\alpha_i$ below.
  • Figure 2: Overview of our approach. As batches of documents arrive, incrementally, they are first pre-processed (they are stemmed, tokenized, and the vocabulary is standardized). We then create a dataset of the counts for each word in each document. We then find the average number of times each word appears in each document (the average word occurrence, which is the first moment $M_1$) and subtract the value of $M_1$ from our existing word-frequency matrix. The resulting document term matrix is our centered dataset, $X$ (Section \ref{['ssec:m1']}). We then perform a singular value decomposition on the centered data, $X$, to recover whitening weights without ever needing to calculate $M_2$, directly. This saves computationally overhead, while being mathematically equivalent. We then use these whitening weights to transform the centered data, $X$, which can be done incrementally (Section \ref{['ssec:m2']}). Finally, we construct the whitened equivalent of the third order moment, $M_3$, which is updated, directly in this factorized form (Section \ref{['ssec:m3']}). This learned factorization can be directly unwhitened and uncentered to recover the classic solution to TLDA (Section \ref{['thm1']}) and recover the topics and their associated word probabilities (Section \ref{['ssec:postprocess']}).
  • Figure 3: Evolution of most prominent political topics in the #MeToo discussion. In each iteration of the dynamic analysis detailed in Section \ref{['ssec:qualitative']}, we inspect the topics, manually label them, and classify them as political or not political. We display the political topic with the highest weight $\alpha_i$ below.
  • Figure 4: Tweets per month in the #MeToo data, in millions.
  • Figure 5: Runtime comparison for TLDA on GPU vs Gensim for the full #MeToo corpus and varying numbers of topics. This shows that the runtime of our method scales near-constantly with respect to the number of topics, while Gensim scales more than linearly.
  • ...and 6 more figures

Theorems & Definitions (1)

  • proof