Graph Topic Modeling for Documents with Spatial or Covariate Dependencies
Yeo Jin Jung, Claire Donnat
TL;DR
Graph-Aligned pLSI (GpLSI) extends the frequentist pLSI framework by incorporating a document similarity graph, enforcing smoothness in document-topic mixtures through a graph-based total-variation penalty, and replacing the standard SVD step with an iterative graph-aligned SVD for denoising. The method yields provable high-probability bounds on the estimation errors for the document-topic matrix $W$ and the word-topic matrix $A$, with rates that improve in well-connected graph topologies. Synthetic experiments show substantial gains in short-document regimes, and real-world studies in spatial transcriptomics and culinary datasets demonstrate improved topic coherence and interpretability. A data-driven cross-validation scheme based on minimum spanning trees selects the graph regularization parameter, making the approach scalable and practically applicable. Overall, GpLSI provides a fast, interpretable, and theoretically grounded framework for topic modeling when document-level covariates or similarities are available.
Abstract
We address the challenge of incorporating document-level metadata into topic modeling to improve topic mixture estimation. To overcome the computational complexity and lack of theoretical guarantees in existing Bayesian methods, we extend probabilistic latent semantic indexing (pLSI), a frequentist framework for topic modeling, by incorporating document-level covariates or known similarities between documents through a graph formalism. Modeling documents as nodes and edges denoting similarities, we propose a new estimator based on a fast graph-regularized iterative singular value decomposition (SVD) that encourages similar documents to share similar topic mixture proportions. We characterize the estimation error of our proposed method by deriving high-probability bounds and develop a specialized cross-validation method to optimize our regularization parameters. We validate our model through comprehensive experiments on synthetic datasets and three real-world corpora, demonstrating improved performance and faster inference compared to existing Bayesian methods.
