Table of Contents
Fetching ...

Tethering Broken Themes: Aligning Neural Topic Models with Labels and Authors

Mayank Nagda, Phil Ostheimer, Sophie Fellenz

TL;DR

We address misalignment in neural topic models by introducing FANToM, which aligns latent topics with document labels and authors via an expert-aligned Dirichlet prior $p_oldsymbol{ ho}(z)$ and a separate topic-author decoder. The approach yields more interpretable topics and meaningful author distributions, improving topic quality and alignment over strong baselines across multiple datasets. The framework supports semi-supervised settings and enables learning a shared embedding space among topics, words, and authors, with LLMs used as experts for labeling. This has practical implications for topic modeling in domains with rich metadata and author signals.

Abstract

Topic models are a popular approach for extracting semantic information from large document collections. However, recent studies suggest that the topics generated by these models often do not align well with human intentions. Although metadata such as labels and authorship information are available, it has not yet been effectively incorporated into neural topic models. To address this gap, we introduce FANToM, a novel method to align neural topic models with both labels and authorship information. FANToM allows for the inclusion of this metadata when available, producing interpretable topics and author distributions for each topic. Our approach demonstrates greater expressiveness than conventional topic models by learning the alignment between labels, topics, and authors. Experimental results show that FANToM improves existing models in terms of both topic quality and alignment. Additionally, it identifies author interests and similarities.

Tethering Broken Themes: Aligning Neural Topic Models with Labels and Authors

TL;DR

We address misalignment in neural topic models by introducing FANToM, which aligns latent topics with document labels and authors via an expert-aligned Dirichlet prior and a separate topic-author decoder. The approach yields more interpretable topics and meaningful author distributions, improving topic quality and alignment over strong baselines across multiple datasets. The framework supports semi-supervised settings and enables learning a shared embedding space among topics, words, and authors, with LLMs used as experts for labeling. This has practical implications for topic modeling in domains with rich metadata and author signals.

Abstract

Topic models are a popular approach for extracting semantic information from large document collections. However, recent studies suggest that the topics generated by these models often do not align well with human intentions. Although metadata such as labels and authorship information are available, it has not yet been effectively incorporated into neural topic models. To address this gap, we introduce FANToM, a novel method to align neural topic models with both labels and authorship information. FANToM allows for the inclusion of this metadata when available, producing interpretable topics and author distributions for each topic. Our approach demonstrates greater expressiveness than conventional topic models by learning the alignment between labels, topics, and authors. Experimental results show that FANToM improves existing models in terms of both topic quality and alignment. Additionally, it identifies author interests and similarities.

Paper Structure

This paper contains 44 sections, 6 equations, 11 figures, 10 tables, 1 algorithm.

Figures (11)

  • Figure 1: FANToM in action: A comparison of semantically closest topics learned by DVAE (left) and DVAE trained with FANToM (right) for alignment. Notably, FANToM not only accurately aligns the learned topic with the label (astrophysics) and authors but also improves the quality of the learned topic.
  • Figure 2: t-SNE projection of topic embeddings from the DVAE model (triangles) and its FANToM variant (squares), alongside document embeddings from the 20NG dataset, color-coded by labels. Ideally, topic embeddings should be positioned near the centroid of their corresponding document clusters. The circled regions highlight discrepancies where DVAE either overrepresents or underrepresents certain topics, while FANToM achieves a more balanced and accurate alignment with document labels, reinforcing its effectiveness in topic representation.
  • Figure 3: Illustration of FANToM: The framework aligns labels and authorship information with topics. It incorporates expert-assigned labels to establish a prior distribution parameterized by $\gamma$, which is then aligned with the posterior. For authorship, a separate decoder is used to learn the multinomial distribution over authors, ensuring a structured representation of author-topic relationships. Overall, FANToM ensures a structured and interpretable alignment between topics, labels, and authors.
  • Figure 4: Comparison of topic alignment between FANToM(L) and DVAE (baseline) on the 20NG dataset. The semantically closest topics are linked (right to left). FANToM(L) cleanly separates topics based on labels, while DVAE lacks this distinction. FANToM(L) generates esoteric topics closely aligned with labels and learns multiple topics within the graphics label.
  • Figure 5: Comparison between the topic words estimated using human labels (bottom) and LLM labels (top) in the ag news corpus, using FANToM(L) for topic estimation. Both align well with the corresponding labels.
  • ...and 6 more figures