Tensor Topic Modeling Via HOSVD

Yating Liu; Claire Donnat

Tensor Topic Modeling Via HOSVD

Yating Liu, Claire Donnat

TL;DR

Tensor Topic Modeling Via HOSVD introduces a tensor-structured extension of topic modeling by modeling the expected word-frequency tensor $oldsymbol{ ext D}$ with a nonnegative Tucker decomposition $oldsymbol{ ext D}=oldsymbol{ ext G} imes( ext{A}^{(1)}, ext{A}^{(2)}, ext{A}^{(3)})$, capturing reviewer, document, and word dimensions simultaneously. The authors develop an HOSVD-based estimation procedure under anchor-word/reviewer/paper assumptions, provide explicit entrywise $ ext{L}_1$ convergence guarantees, and extend the method to accommodate noise and weak $oldsymbol{ ext l}_q$ sparsity. They propose a practical pipeline to recover the factor matrices and core tensor, including SCORE normalization for the word mode and a core-recovery step that yields interpretable cross-mode interactions. Through synthetic experiments and real data—ArXiv abstracts, vaginal microbiome dynamics, and market-basket transactions—the method demonstrates superior reconstruction accuracy, stability, and richer interpretability of multi-way topic interactions compared to baselines like Tensor-LDA, STM, and NTD. The work highlights the benefits of leveraging tensor structure to analyze multi-dimensional count data, with potential privacy advantages via vocabulary thresholding and broad applicability across text, biology, and consumer analytics.

Abstract

By representing documents as mixtures of topics, topic modeling has allowed the successful analysis of datasets across a wide spectrum of applications ranging from ecology to genetics. An important body of recent work has demonstrated the computational and statistical efficiency of probabilistic Latent Semantic Indexing (pLSI)-- a type of topic modeling -- in estimating both the topic matrix (corresponding to distributions over word frequencies), and the topic assignment matrix. However, these methods are not easily extendable to the incorporation of additional temporal, spatial, or document-specific information, thereby potentially neglecting useful information in the analysis of spatial or longitudinal datasets that can be represented as tensors. Consequently, in this paper, we propose using a modified higher-order singular value decomposition (HOSVD) to estimate topic models based on a Tucker decomposition, thus accommodating the complexity of tensor data. Our method exploits the strength of tensor decomposition in reducing data to lower-dimensional spaces and successfully recovers lower-rank topic and cluster structures, as well as a core tensor that highlights interactions among latent factors. We further characterize explicitly the convergence rate of our method in entry-wise $\ell_1$ norm. Experiments on synthetic data demonstrate the statistical efficiency of our method and its ability to better capture patterns across multiple dimensions. Additionally, our approach also performs well when applied to large datasets of research abstracts and in the analysis of vaginal microbiome data.

Tensor Topic Modeling Via HOSVD

TL;DR

Tensor Topic Modeling Via HOSVD introduces a tensor-structured extension of topic modeling by modeling the expected word-frequency tensor

with a nonnegative Tucker decomposition

, capturing reviewer, document, and word dimensions simultaneously. The authors develop an HOSVD-based estimation procedure under anchor-word/reviewer/paper assumptions, provide explicit entrywise

convergence guarantees, and extend the method to accommodate noise and weak

sparsity. They propose a practical pipeline to recover the factor matrices and core tensor, including SCORE normalization for the word mode and a core-recovery step that yields interpretable cross-mode interactions. Through synthetic experiments and real data—ArXiv abstracts, vaginal microbiome dynamics, and market-basket transactions—the method demonstrates superior reconstruction accuracy, stability, and richer interpretability of multi-way topic interactions compared to baselines like Tensor-LDA, STM, and NTD. The work highlights the benefits of leveraging tensor structure to analyze multi-dimensional count data, with potential privacy advantages via vocabulary thresholding and broad applicability across text, biology, and consumer analytics.

Abstract

norm. Experiments on synthetic data demonstrate the statistical efficiency of our method and its ability to better capture patterns across multiple dimensions. Additionally, our approach also performs well when applied to large datasets of research abstracts and in the analysis of vaginal microbiome data.

Paper Structure (65 sections, 39 theorems, 208 equations, 36 figures, 6 tables)

This paper contains 65 sections, 39 theorems, 208 equations, 36 figures, 6 tables.

Introduction
Our contributions
Notations
Tensor Topic Modeling
Oracle procedure for the estimation of the factor matrices A(1), A(2), A(3)
Estimating A(1) and A(2)
Estimating A(3)
Estimating the Latent Factors
Oracle Procedure:
Estimation procedure in the presence of noise
Estimating the core tensor G
Theoretical Results
Experiments on synthetic data
Data Generation Procedure.
Benchmarks
...and 50 more sections

Key Result

Lemma 2.1

Suppose $\sigma_{K^{(3)}}(\mathbf{A}^{(3)})\geqslant c^*$ for some constant $c^*>0$. Then, there exists a positive vector $\mathbf{q}_0\in\mathbb{R}^{K^{(3)}}$ such that $\tilde{\mathbf{V}}^{(3)}=\text{diag}(\mathbf{q}_0) \mathbf{V}^{*(3)}$ and where $\mathbf{V}^{*(3)}$ is the set of vertices extracted by the vertex hunting procedure.

Figures (36)

Figure 1: Tucker components $\mathbf{A}^{(1)},\mathbf{A}^{(2)},\mathbf{A}^{(3)},\mathcal{G}$ (Left to Right respectively) estimated by our method with ranks $(2,2,3)$ for the tensor from Figure \ref{['fig:setting1_D_mixed']}. Note that the factor $\mathbf{A}^{(1)}$ shows a clear clustering pattern. The estimated factor $\mathbf{A}^{(2)}$ also exhibits two clusters. Factor $\mathbf{A}^{(3)}$ shows the topics. The core tensor $\mathcal{G}$ highlights interactions between the clusters of multiple modes. Each slice in the last subfigure represents the clusters derived from $\mathbf{A}^{(2)}$. The reconstruction error is $\|\mathcal{\hat{D}}-\mathcal{D}\|_1=22.687$
Figure 2: Tucker components $\mathbf{A}^{(1)},\mathbf{A}^{(2)},\mathbf{A}^{(3)},\mathcal{G}$ (Left to Right respectively) estimated by Nonnegative Tucker decomposition (NTD) with ranks $(2,2,3)$. The reconstruction error is $\|\mathcal{\hat{D}}-\mathcal{D}\|_1=91.543$.
Figure 3: Tucker components $\mathbf{A}^{(1)},\mathbf{A}^{(2)},\mathbf{A}^{(3)},\mathcal{G}$ (Left to Right respectively) estimated by Tensor LDA with ranks $(2,2,3)$. The reconstruction error is $\|\mathcal{\hat{D}}-\mathcal{D}\|_1=27.262$.
Figure 4: Tucker components $\mathbf{A}^{(1)},\mathbf{A}^{(2)},\mathbf{A}^{(3)},\mathcal{G}$ (Left to Right respectively) estimated by Hybrid-LDA with ranks $(2,2,3)$. The reconstruction error is $\|\mathcal{\hat{D}}-\mathcal{D}\|_1=82.703$.
Figure 5: Boxplot of the reconstruction errors over 30 runs on the same dataset from Figure \ref{['fig:setting1_D_mixed']}. The middle line is the median value over 30 runs. The red point refers to the mean value.
...and 31 more figures

Theorems & Definitions (82)

Definition 1.1: Anchor document assumption
Definition 1.2: Anchor word assumption
Remark 1.1
Definition 2.1: Ideal Simplex
Definition 2.2: Vertex hunting
Definition 2.3
Remark 2.1
Remark 2.2
Lemma 2.1
Remark 2.3: Comparison with HOOI
...and 72 more

Tensor Topic Modeling Via HOSVD

TL;DR

Abstract

Tensor Topic Modeling Via HOSVD

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (36)

Theorems & Definitions (82)