Table of Contents
Fetching ...

Non-negative matrix factorization algorithms generally improve topic model fits

Peter Carbonetto, Abhishek Sarkar, Zihao Wang, Matthew Stephens

TL;DR

The paper addresses efficient maximum-likelihood estimation for count-based topic representations by exploiting the equivalence between Poisson NMF and a multinomial topic framework. It formalizes a PNMF-to-MTM mapping and shows that fast Poisson-NMF optimization, particularly coordinate descent with extrapolation, can outperform traditional EM-based fitting while yielding better parameter estimates. The authors implement these methods in the fastTopics R package and demonstrate substantial speedups and improved fits on both text and single-cell datasets. They conclude that fitting Poisson NMF and then recovering the topic representation provides a practical, scalable approach for large-scale count-based topic analyses.

Abstract

In an effort to develop topic modeling methods that can be quickly applied to large data sets, we revisit the problem of maximum-likelihood estimation in topic models. It is known, at least informally, that maximum-likelihood estimation in topic models is closely related to non-negative matrix factorization (NMF). Yet, to our knowledge, this relationship has not been exploited previously to fit topic models. We show that recent advances in NMF optimization methods can be leveraged to fit topic models very efficiently, often resulting in much better fits and in less time than existing algorithms for topic models. We also formally make the connection between the NMF optimization problem and maximum-likelihood estimation for the topic model, and using this result we show that the expectation maximization (EM) algorithm for the topic model is essentially the same as the classic multiplicative updates for NMF (the only difference being that the operations are performed in a different order). Our methods are implemented in the R package fastTopics.

Non-negative matrix factorization algorithms generally improve topic model fits

TL;DR

The paper addresses efficient maximum-likelihood estimation for count-based topic representations by exploiting the equivalence between Poisson NMF and a multinomial topic framework. It formalizes a PNMF-to-MTM mapping and shows that fast Poisson-NMF optimization, particularly coordinate descent with extrapolation, can outperform traditional EM-based fitting while yielding better parameter estimates. The authors implement these methods in the fastTopics R package and demonstrate substantial speedups and improved fits on both text and single-cell datasets. They conclude that fitting Poisson NMF and then recovering the topic representation provides a practical, scalable approach for large-scale count-based topic analyses.

Abstract

In an effort to develop topic modeling methods that can be quickly applied to large data sets, we revisit the problem of maximum-likelihood estimation in topic models. It is known, at least informally, that maximum-likelihood estimation in topic models is closely related to non-negative matrix factorization (NMF). Yet, to our knowledge, this relationship has not been exploited previously to fit topic models. We show that recent advances in NMF optimization methods can be leveraged to fit topic models very efficiently, often resulting in much better fits and in less time than existing algorithms for topic models. We also formally make the connection between the NMF optimization problem and maximum-likelihood estimation for the topic model, and using this result we show that the expectation maximization (EM) algorithm for the topic model is essentially the same as the classic multiplicative updates for NMF (the only difference being that the operations are performed in a different order). Our methods are implemented in the R package fastTopics.

Paper Structure

This paper contains 25 sections, 3 theorems, 50 equations, 12 figures, 2 tables, 1 algorithm.

Key Result

Lemma 1

Denote the Poisson NMF likelihood by $p_{\mathrm{PNMF}}({{\bf X}} \mid {\bf H}, {\bf W})$ and denote the multinomial topic model likelihood by $p_{\mathrm{MTM}}({{\bf X}} \mid {\bf L}, {\bf F})$. Assume ${\bf H} \in {\bf R}_{+}^{n \times K}$ and ${\bf W} \in {\bf R}_{+}^{m \times K}$, define $t_i \c where $\mathrm{Pois}(x; \lambda)$ denotes the probability mass function of the Poisson distribution

Figures (12)

  • Figure 1: Results of fitting multinomial topic models to the MCF-7 data set mcf7 with $K = 3$. Plots A--C show estimates of the $41 \times 3$ matrix ${\bf L}$: the initial estimates (obtained by running 4 EM updates); the MLE (obtained by running many CD updates, starting from the initial estimates); and the estimates obtained by running 200 EM updates starting from the initial estimates. Each estimate of ${\bf L}$ is visualized using a "Structure plot" rosenberg-2002, which is a stacked bar chart in which the bar heights are given by the elements of ${\bf L}$. Plots D, E show the improvement in the multinomial topic model fits over time. Multinomial topic model og-likelihoods are shown relative to the log-likelihood of the multinomial topic model at the MLE (B); points highest on the y-axis indicate the worst log-likelihoods.
  • Figure 2: Selected results on fitting topic models using Poisson NMF algorithms. In A1--F1, multinomial topic model log-likelihoods are given relative to the best log-likelihood obtained among the four algorithms compared (EM and CD, with and without extrapolation). Log-likelihood differences less than 0.01 are shown as 0.01, and circles are drawn at intervals of 100 iterations. The 1,000 EM iterations performed during the initialization phase are not shown. Plots A2--F2 compare the final estimates of ${\bf L}$ from each of A1--F1. See also Figures \ref{['fig:loglik-nips']}--\ref{['fig:kkt-pbmc68k']} in the Appendix for additional results obtained with different settings of $K$.
  • Figure 3: Estimates of ${\bf L}$ from the newsgroups data with $K = 10$ obtained by running the EM updates without extrapolation (top) and the CD updates with extrapolation (bottom). The estimates of ${\bf L}$ are visualized using Structure plots. The documents are arranged by newsgroup to show the correspondence between the newsgroups and the topics. Note that the ordering of the documents within each newsgroup is not exactly the same in the top and bottom plots. See E1 and E2 in Fig. \ref{['fig:results-main']} for related results.
  • Figure 4: Estimates of ${\bf L}$ from the 68k PBMC data with $K = 7$ obtained by running the EM updates without extrapolation (top) and the CD updates with extrapolation (bottom). The estimates of ${\bf L}$ are visualized using Structure plots. To facilitate comparison, the cells were split into 5 groups based on the CD estimates of ${\bf L}$; these groups roughly correspond to cell types (B cells, T cells, etc). The "T cells" group was downsampled to better visualize the other groups. Note that the ordering of the cells within each grouping is not exactly the same in the top and bottom plots. See F1 and F2 in Fig. \ref{['fig:results-main']} for related results.
  • Figure 5: Improvement in model fit over time for the different Poisson NMF algorithms applied to the NeurIPS data. Multinomial topic model log-likelihoods are shown relative to the best log-likelihood recovered among the four algorithms compared (EM and CD, with and without extrapolation). Log-likelihood differences less than 0.01 are shown as 0.01. Circles are drawn at intervals of 100 iterations. Note that the 1,000 EM iterations performed during the initialization phase are not shown.
  • ...and 7 more figures

Theorems & Definitions (8)

  • Definition 1: Poisson NMF to multinomial topic model reparameterization
  • Lemma 1: Equivalence of Poisson NMF and multinomial topic model likelihoods
  • proof
  • Corollary 1: Relationship between MLEs for Poisson NMF and multinomial topic model
  • proof
  • Remark 1
  • Corollary 2: Relationship between MAP estimates for Poisson NMF and the multinomial topic model
  • proof