Table of Contents
Fetching ...

Poisson-Process Topic Model for Integrating Knowledge from Pre-trained Language Models

Morgane Austern, Yuanchuan Guo, Zheng Tracy Ke, Tianle Liu

TL;DR

This work develops a principled framework to integrate contextualized word embeddings from pre-trained language models into topic modeling by modeling each document as a Poisson point process with intensity $\Omega_i(z)=\sum_{k=1}^K w_i(k)\mathcal{A}_k(z)$. The TRACE algorithm discretizes embedding space into a net of hyperwords, applies a traditional TM to the hyperword counts, and then kernel-smooths to recover nonparametric topic densities, while estimating per-document topic weights. The authors establish Hölder-smoothness-based convergence rates, prove a minimax lower bound, and demonstrate rate-optimality for $\beta\le1$, with empirical validation on the Associated Press and MADStat datasets showing improved topic coherence, interpretability, and downstream clustering compared to word-count TM and several embedding-based baselines. The framework offers a flexible, theoretically grounded pathway to harness contextual embeddings for context-aware topics and can accommodate plug-in traditional TM methods without altering their core procedures. Overall, PPTM with TRACE advances topic modeling by integrating rich semantic context while preserving probabilistic guarantees and providing practical, scalable tooling for real-world text corpora.

Abstract

Topic modeling is traditionally applied to word counts without accounting for the context in which words appear. Recent advancements in large language models (LLMs) offer contextualized word embeddings, which capture deeper meaning and relationships between words. We aim to leverage such embeddings to improve topic modeling. We use a pre-trained LLM to convert each document into a sequence of word embeddings. This sequence is then modeled as a Poisson point process, with its intensity measure expressed as a convex combination of $K$ base measures, each corresponding to a topic. To estimate these topics, we propose a flexible algorithm that integrates traditional topic modeling methods, enhanced by net-rounding applied before and kernel smoothing applied after. One advantage of this framework is that it treats the LLM as a black box, requiring no fine-tuning of its parameters. Another advantage is its ability to seamlessly integrate any traditional topic modeling approach as a plug-in module, without the need for modifications Assuming each topic is a $β$-Hölder smooth intensity measure on the embedded space, we establish the rate of convergence of our method. We also provide a minimax lower bound and show that the rate of our method matches with the lower bound when $β\leq 1$. Additionally, we apply our method to several datasets, providing evidence that it offers an advantage over traditional topic modeling approaches.

Poisson-Process Topic Model for Integrating Knowledge from Pre-trained Language Models

TL;DR

This work develops a principled framework to integrate contextualized word embeddings from pre-trained language models into topic modeling by modeling each document as a Poisson point process with intensity . The TRACE algorithm discretizes embedding space into a net of hyperwords, applies a traditional TM to the hyperword counts, and then kernel-smooths to recover nonparametric topic densities, while estimating per-document topic weights. The authors establish Hölder-smoothness-based convergence rates, prove a minimax lower bound, and demonstrate rate-optimality for , with empirical validation on the Associated Press and MADStat datasets showing improved topic coherence, interpretability, and downstream clustering compared to word-count TM and several embedding-based baselines. The framework offers a flexible, theoretically grounded pathway to harness contextual embeddings for context-aware topics and can accommodate plug-in traditional TM methods without altering their core procedures. Overall, PPTM with TRACE advances topic modeling by integrating rich semantic context while preserving probabilistic guarantees and providing practical, scalable tooling for real-world text corpora.

Abstract

Topic modeling is traditionally applied to word counts without accounting for the context in which words appear. Recent advancements in large language models (LLMs) offer contextualized word embeddings, which capture deeper meaning and relationships between words. We aim to leverage such embeddings to improve topic modeling. We use a pre-trained LLM to convert each document into a sequence of word embeddings. This sequence is then modeled as a Poisson point process, with its intensity measure expressed as a convex combination of base measures, each corresponding to a topic. To estimate these topics, we propose a flexible algorithm that integrates traditional topic modeling methods, enhanced by net-rounding applied before and kernel smoothing applied after. One advantage of this framework is that it treats the LLM as a black box, requiring no fine-tuning of its parameters. Another advantage is its ability to seamlessly integrate any traditional topic modeling approach as a plug-in module, without the need for modifications Assuming each topic is a -Hölder smooth intensity measure on the embedded space, we establish the rate of convergence of our method. We also provide a minimax lower bound and show that the rate of our method matches with the lower bound when . Additionally, we apply our method to several datasets, providing evidence that it offers an advantage over traditional topic modeling approaches.

Paper Structure

This paper contains 42 sections, 18 theorems, 137 equations, 16 figures, 16 tables, 1 algorithm.

Key Result

Lemma 2.1

Let ${\cal T}$ be a transformer as in transformer, where each ${\cal T}^{\text{att}}_\ell$ consists of a Multi-Head Attention sub-layer and a Feed Forward sub-layer, with two sub-layers connected by a Residual Connection. Then, ${\cal F}={\cal T}^{\text{att}}_L\circ {\cal T}^{\text{att}}_{L-1}\circ\

Figures (16)

  • Figure 1: The proposed topic modeling approach.
  • Figure 2: The simulation results.
  • Figure 3: The top 20 anchor words for each estimated topic measure on the AP dataset.
  • Figure 4: Estimated topic measures on the AP dataset ($K=7$). For each topic, we compute $\widehat{\cal B}_k(z)=\widehat{\cal A}_k(z)/[\sum_{\ell=1}^K \widehat{\cal A}_\ell(z)]$ and plot its contour in a projected two-dimensional space (the projection is only made for visualization). Nine anchor regions are marked in the plots, each indicating a group of nearly-anchor words. Some representative words in each numbered region are given in the plots.
  • Figure 5: Heptagon plots showing the embeddings of three words, air (left), bank (middle), and bond (right), across topics of the AP dataset. Each vertex of the heptagon represents a topic, and red dots indicate embeddings associated with the respective word. Annotated examples highlight different contexts in which these words appear, demonstrating their semantic variability across topics.
  • ...and 11 more figures

Theorems & Definitions (23)

  • Lemma 2.1
  • Definition 1
  • Lemma 3.1
  • Lemma 4.1
  • Lemma 4.2: Bias
  • Corollary 4.1
  • Lemma 4.3: Variance
  • Theorem 4.1
  • Theorem 4.2
  • Theorem 4.3
  • ...and 13 more