Table of Contents
Fetching ...

Graph Topic Modeling for Documents with Spatial or Covariate Dependencies

Yeo Jin Jung, Claire Donnat

TL;DR

Graph-Aligned pLSI (GpLSI) extends the frequentist pLSI framework by incorporating a document similarity graph, enforcing smoothness in document-topic mixtures through a graph-based total-variation penalty, and replacing the standard SVD step with an iterative graph-aligned SVD for denoising. The method yields provable high-probability bounds on the estimation errors for the document-topic matrix $W$ and the word-topic matrix $A$, with rates that improve in well-connected graph topologies. Synthetic experiments show substantial gains in short-document regimes, and real-world studies in spatial transcriptomics and culinary datasets demonstrate improved topic coherence and interpretability. A data-driven cross-validation scheme based on minimum spanning trees selects the graph regularization parameter, making the approach scalable and practically applicable. Overall, GpLSI provides a fast, interpretable, and theoretically grounded framework for topic modeling when document-level covariates or similarities are available.

Abstract

We address the challenge of incorporating document-level metadata into topic modeling to improve topic mixture estimation. To overcome the computational complexity and lack of theoretical guarantees in existing Bayesian methods, we extend probabilistic latent semantic indexing (pLSI), a frequentist framework for topic modeling, by incorporating document-level covariates or known similarities between documents through a graph formalism. Modeling documents as nodes and edges denoting similarities, we propose a new estimator based on a fast graph-regularized iterative singular value decomposition (SVD) that encourages similar documents to share similar topic mixture proportions. We characterize the estimation error of our proposed method by deriving high-probability bounds and develop a specialized cross-validation method to optimize our regularization parameters. We validate our model through comprehensive experiments on synthetic datasets and three real-world corpora, demonstrating improved performance and faster inference compared to existing Bayesian methods.

Graph Topic Modeling for Documents with Spatial or Covariate Dependencies

TL;DR

Graph-Aligned pLSI (GpLSI) extends the frequentist pLSI framework by incorporating a document similarity graph, enforcing smoothness in document-topic mixtures through a graph-based total-variation penalty, and replacing the standard SVD step with an iterative graph-aligned SVD for denoising. The method yields provable high-probability bounds on the estimation errors for the document-topic matrix and the word-topic matrix , with rates that improve in well-connected graph topologies. Synthetic experiments show substantial gains in short-document regimes, and real-world studies in spatial transcriptomics and culinary datasets demonstrate improved topic coherence and interpretability. A data-driven cross-validation scheme based on minimum spanning trees selects the graph regularization parameter, making the approach scalable and practically applicable. Overall, GpLSI provides a fast, interpretable, and theoretically grounded framework for topic modeling when document-level covariates or similarities are available.

Abstract

We address the challenge of incorporating document-level metadata into topic modeling to improve topic mixture estimation. To overcome the computational complexity and lack of theoretical guarantees in existing Bayesian methods, we extend probabilistic latent semantic indexing (pLSI), a frequentist framework for topic modeling, by incorporating document-level covariates or known similarities between documents through a graph formalism. Modeling documents as nodes and edges denoting similarities, we propose a new estimator based on a fast graph-regularized iterative singular value decomposition (SVD) that encourages similar documents to share similar topic mixture proportions. We characterize the estimation error of our proposed method by deriving high-probability bounds and develop a specialized cross-validation method to optimize our regularization parameters. We validate our model through comprehensive experiments on synthetic datasets and three real-world corpora, demonstrating improved performance and faster inference compared to existing Bayesian methods.

Paper Structure

This paper contains 52 sections, 23 theorems, 236 equations, 21 figures, 3 algorithms.

Key Result

Theorem 1

Suppose $\max(K,p) \leq n$ and $\sqrt{K} \leq p$. Under Assumptions assumption:smoothness to assumption:h_j, the eigenvectors of the matrix ${X^{\top}X} -\frac{n}{N} \hat{D}_0$ provide a reasonable approximation to the right singular vectors, in that with probability at least $1-o(n^{-1})$: for some constants $C$ and $C^*>0$.

Figures (21)

  • Figure 1: $\ell_2$ error for the estimator $\widehat{W}$ (defined as $\text{min}_{P \in \mathcal{P}}\frac{1}{n}\| \widehat{W} - WP\|_{F}$) for different combinations of document length $N$ and vocabulary size $p$. Here, $n=1000$ and $K=3$.
  • Figure 2: $\ell_2$ error of $W$ (left) and $A$ (middle) and computation time (right) for different corpus size $n$ and number of topics $K$. Here, $N=30$ and $p=30$. Errors are normalzied by $n$.
  • Figure 3: (A) Estimated tumor-immune topic weights of GpLSI, pLSI, and LDA. Topic weights are aligned across methods using cosine similarity. (B) Topic alignment paths of GpLSI, pLSI, and LDA using R package alto. (C) Pairwise $\ell_1$ distance and cosine similarity of topic weights from different batches of patients.
  • Figure 4: (A) AUC for predicting cancer recurrence using isometric log-ratio transformed topic proportions (top) and dichotomized topic proportions (bottom) as covariates. (B) Kaplan-Meier curves based on dichotomized topic proportions using GpLSI.
  • Figure 5: (A) Visualization of estimated B cell microenvironment topics for $K=3,5,7,10$. (B) Comparison of clustering performance using Moran's I and PAS score. We plot 1-PAS for better interpretation. (C) Estimated B cell microenvironment topic weights for $K=5$ using GpLSI.
  • ...and 16 more figures

Theorems & Definitions (47)

  • Remark 1
  • Remark 2
  • Remark 3
  • Remark 4
  • Theorem 1
  • Theorem 2
  • Remark 5
  • Theorem 3
  • Corollary 1
  • Theorem 4
  • ...and 37 more