Table of Contents
Fetching ...

Constrained Non-negative Matrix Factorization for Guided Topic Modeling of Minority Topics

Seyedeh Fatemeh Ebrahimi, Jaakko Peltonen

TL;DR

The paper tackles minority-topic discovery in topic modeling by introducing a constrained non-negative matrix factorization framework that uses a seed-word list and soft prevalence bounds. It formulates a generalized KL divergence objective $D_{KL}(V \parallel WH)$ under two sets of inequality constraints on $W$ and $H$ and derives KKT-based multiplicative updates to optimize the model. Empirical results on synthetic data and a real-world YouTube mental-health case study show improved topic purity, higher NMI, and lower Jensen-Shannon divergence compared to baselines, demonstrating effective minority-content recovery without rigid supervision. The approach offers a scalable, flexible mechanism for extracting domain-relevant but low-prevalence themes in imbalanced corpora, with potential extensions to neural-contextual topic models and broader domains.

Abstract

Topic models often fail to capture low-prevalence, domain-critical themes, so-called minority topics, such as mental health themes in online comments. While some existing methods can incorporate domain knowledge, such as expected topical content, methods allowing guidance may require overly detailed expected topics, hindering the discovery of topic divisions and variation. We propose a topic modeling solution via a specially constrained NMF. We incorporate a seed word list characterizing minority content of interest, but we do not require experts to pre-specify their division across minority topics. Through prevalence constraints on minority topics and seed word content across topics, we learn distinct data-driven minority topics as well as majority topics. The constrained NMF is fitted via Karush-Kuhn-Tucker (KKT) conditions with multiplicative updates. We outperform several baselines on synthetic data in terms of topic purity, normalized mutual information, and also evaluate topic quality using Jensen-Shannon divergence (JSD). We conduct a case study on YouTube vlog comments, analyzing viewer discussion of mental health content; our model successfully identifies and reveals this domain-relevant minority content.

Constrained Non-negative Matrix Factorization for Guided Topic Modeling of Minority Topics

TL;DR

The paper tackles minority-topic discovery in topic modeling by introducing a constrained non-negative matrix factorization framework that uses a seed-word list and soft prevalence bounds. It formulates a generalized KL divergence objective under two sets of inequality constraints on and and derives KKT-based multiplicative updates to optimize the model. Empirical results on synthetic data and a real-world YouTube mental-health case study show improved topic purity, higher NMI, and lower Jensen-Shannon divergence compared to baselines, demonstrating effective minority-content recovery without rigid supervision. The approach offers a scalable, flexible mechanism for extracting domain-relevant but low-prevalence themes in imbalanced corpora, with potential extensions to neural-contextual topic models and broader domains.

Abstract

Topic models often fail to capture low-prevalence, domain-critical themes, so-called minority topics, such as mental health themes in online comments. While some existing methods can incorporate domain knowledge, such as expected topical content, methods allowing guidance may require overly detailed expected topics, hindering the discovery of topic divisions and variation. We propose a topic modeling solution via a specially constrained NMF. We incorporate a seed word list characterizing minority content of interest, but we do not require experts to pre-specify their division across minority topics. Through prevalence constraints on minority topics and seed word content across topics, we learn distinct data-driven minority topics as well as majority topics. The constrained NMF is fitted via Karush-Kuhn-Tucker (KKT) conditions with multiplicative updates. We outperform several baselines on synthetic data in terms of topic purity, normalized mutual information, and also evaluate topic quality using Jensen-Shannon divergence (JSD). We conduct a case study on YouTube vlog comments, analyzing viewer discussion of mental health content; our model successfully identifies and reveals this domain-relevant minority content.

Paper Structure

This paper contains 34 sections, 62 equations, 9 figures, 6 tables, 3 algorithms.

Figures (9)

  • Figure 1: Comparison of NMI and Purity Scores across Baselines on synthetic dataset (20 topics, 7 mental health topics, 500 samples). Left: Result table, the best is in bold. Right: results as a bar graph.
  • Figure 2: Topic Quality using JSD Score
  • Figure 3: KL Divergence Across Iterations with Error Bars.
  • Figure 4: Topic Quality using JSD Score
  • Figure 5: Effect of $W_{\text{max}}$ and $\theta_{\text{min}}$ on NMI and purity scores.
  • ...and 4 more figures