Table of Contents
Fetching ...

PRISM: PRIor from corpus Statistics for topic Modeling

Tal Ishon, Yoav Goldberg, Uri Shaham

Abstract

Topic modeling seeks to uncover latent semantic structure in text, with LDA providing a foundational probabilistic framework. While recent methods often incorporate external knowledge (e.g., pre-trained embeddings), such reliance limits applicability in emerging or underexplored domains. We introduce \textbf{PRISM}, a corpus-intrinsic method that derives a Dirichlet parameter from word co-occurrence statistics to initialize LDA without altering its generative process. Experiments on text and single cell RNA-seq data show that PRISM improves topic coherence and interpretability, rivaling models that rely on external knowledge. These results underscore the value of corpus-driven initialization for topic modeling in resource-constrained settings. Code is available at: https://github.com/shaham-lab/PRISM.

PRISM: PRIor from corpus Statistics for topic Modeling

Abstract

Topic modeling seeks to uncover latent semantic structure in text, with LDA providing a foundational probabilistic framework. While recent methods often incorporate external knowledge (e.g., pre-trained embeddings), such reliance limits applicability in emerging or underexplored domains. We introduce \textbf{PRISM}, a corpus-intrinsic method that derives a Dirichlet parameter from word co-occurrence statistics to initialize LDA without altering its generative process. Experiments on text and single cell RNA-seq data show that PRISM improves topic coherence and interpretability, rivaling models that rely on external knowledge. These results underscore the value of corpus-driven initialization for topic modeling in resource-constrained settings. Code is available at: https://github.com/shaham-lab/PRISM.

Paper Structure

This paper contains 78 sections, 11 equations, 9 figures, 12 tables, 1 algorithm.

Figures (9)

  • Figure 1: PRISM overview. From corpus $\mathcal{D}$, we build a second-order word-similarity graph (PPMI + cosine), obtain diffusion-map embeddings $v_i$, softly cluster them into $K$ topics, and estimate a data-driven topic--word Dirichlet prior $\hat{\boldsymbol{\beta}}$ for LDA. The lower panel shows the standard LDA graphical model; PRISM replaces the symmetric $\boldsymbol{\beta}$ with $\hat{\boldsymbol{\beta}}$ while leaving the generative process unchanged.
  • Figure 2: Top-10 words per topic on the BBC dataset with $K=5$. Each column is a distinct topic. Colors denote manually interpreted categories (blackpolitics, blackentertainment, blackbusiness, blacksports, blacktechnology); lighter shades indicate weaker relevance and white indicates no clear association. Panels: (\ref{['fig:bertopic_bbc5']}) BERTopic, (\ref{['fig:prodlda_bbc5']}) ProdLDA, (\ref{['fig:prism_bbc5']}) PRISM.
  • Figure 3: Top-10 words for biology topics in M10.
  • Figure 4: Top-10 words for a climate/agriculture topic in M10.
  • Figure 5: Illustration of the Word Intrusion Detection (WID) framework. A large language model is prompted to identify the word that does not belong in a list of top topic words (a.k.a. the intruder). The prompt shown here is illustrative; actual prompts used in our experiments follow a more structured format.
  • ...and 4 more figures