Table of Contents
Fetching ...

DiffETM: Diffusion Process Enhanced Embedded Topic Model

Wei Shao, Mingyang Liu, Linqi Song

TL;DR

This work tackles the limited expressivity of ETM arising from its logistic-normal assumption for document-topic distributions by introducing a diffusion process to enrich the representation while preserving optimization. The proposed DiffETM comprises three modules—diffusion-based enhancement, document-topic distribution, and topic-word distribution—trained with a reconstruction loss plus a KL regularization term, yielding the objective $L = L(\boldsymbol{X}, \boldsymbol{X'}) + \lambda \cdot KL(\boldsymbol{z} \| \mathcal{N}(0,1))$. The forward diffusion uses a linear noise schedule and yields representations closer to normal, while the diffusion-guided latent $\boldsymbol{z}$ feeds a softmax to produce $\boldsymbol{\theta}$ for topic inference and $\boldsymbol{\beta}$ for word distributions, enabling accurate reconstruction. Empirical results on 20NewsGroup and NYT show consistently improved topic coherence, diversity, quality, and perplexity over strong baselines, with up to 77.89% relative gains in topic quality, validating diffusion-augmented topic modeling as a practical approach.

Abstract

The embedded topic model (ETM) is a widely used approach that assumes the sampled document-topic distribution conforms to the logistic normal distribution for easier optimization. However, this assumption oversimplifies the real document-topic distribution, limiting the model's performance. In response, we propose a novel method that introduces the diffusion process into the sampling process of document-topic distribution to overcome this limitation and maintain an easy optimization process. We validate our method through extensive experiments on two mainstream datasets, proving its effectiveness in improving topic modeling performance.

DiffETM: Diffusion Process Enhanced Embedded Topic Model

TL;DR

This work tackles the limited expressivity of ETM arising from its logistic-normal assumption for document-topic distributions by introducing a diffusion process to enrich the representation while preserving optimization. The proposed DiffETM comprises three modules—diffusion-based enhancement, document-topic distribution, and topic-word distribution—trained with a reconstruction loss plus a KL regularization term, yielding the objective . The forward diffusion uses a linear noise schedule and yields representations closer to normal, while the diffusion-guided latent feeds a softmax to produce for topic inference and for word distributions, enabling accurate reconstruction. Empirical results on 20NewsGroup and NYT show consistently improved topic coherence, diversity, quality, and perplexity over strong baselines, with up to 77.89% relative gains in topic quality, validating diffusion-augmented topic modeling as a practical approach.

Abstract

The embedded topic model (ETM) is a widely used approach that assumes the sampled document-topic distribution conforms to the logistic normal distribution for easier optimization. However, this assumption oversimplifies the real document-topic distribution, limiting the model's performance. In response, we propose a novel method that introduces the diffusion process into the sampling process of document-topic distribution to overcome this limitation and maintain an easy optimization process. We validate our method through extensive experiments on two mainstream datasets, proving its effectiveness in improving topic modeling performance.
Paper Structure (16 sections, 13 equations, 2 figures, 5 tables)

This paper contains 16 sections, 13 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: This figure shows the KL loss changes of several classical embedded topic models on the test set of the 20NewsGroup dataset when the topic number is 50. Each point means a better model checkpoint than previous ones. The loss is the KL divergence between sampled topic distribution variables and the normal distribution. According to this figure, we could find that the loss keeps being larger in the training process. This demonstrates that when the sampled document-topic distributions tend to break the limitations of conforming to the logistic-normal distribution for a better topic modeling performance.
  • Figure 2: This is the architecture of our proposed model, including the diffusion module (in yellow), document-topic distribution computation module (in green), and topic-word distribution module (in blue).