Table of Contents
Fetching ...

GINopic: Topic Modeling with Graph Isomorphism Network

Suman Adhya, Debarshi Kumar Sanyal

TL;DR

GINopic addresses the challenge of incorporating word dependencies into topic modeling by constructing per-document word similarity graphs and processing them with a Graph Isomorphism Network within a variational autoencoder framework. By fusing graph embeddings with TF-IDF features and employing a Dirichlet-prior-informed VAE objective, it yields enhanced topic coherence and discriminative latent spaces across multiple datasets. The approach demonstrates strong intrinsic (coherence/diversity) and extrinsic (classification) performance, with thorough analyses of GIN vs other GNNs and the impact of the graph-threshold δ on quality and efficiency. This work advances topic modeling by explicitly capturing word-level dependencies beyond sliding windows or flat embeddings, offering a scalable, graph-based pathway for more coherent and diverse topics in large text corpora.

Abstract

Topic modeling is a widely used approach for analyzing and exploring large document collections. Recent research efforts have incorporated pre-trained contextualized language models, such as BERT embeddings, into topic modeling. However, they often neglect the intrinsic informational value conveyed by mutual dependencies between words. In this study, we introduce GINopic, a topic modeling framework based on graph isomorphism networks to capture the correlation between words. By conducting intrinsic (quantitative as well as qualitative) and extrinsic evaluations on diverse benchmark datasets, we demonstrate the effectiveness of GINopic compared to existing topic models and highlight its potential for advancing topic modeling.

GINopic: Topic Modeling with Graph Isomorphism Network

TL;DR

GINopic addresses the challenge of incorporating word dependencies into topic modeling by constructing per-document word similarity graphs and processing them with a Graph Isomorphism Network within a variational autoencoder framework. By fusing graph embeddings with TF-IDF features and employing a Dirichlet-prior-informed VAE objective, it yields enhanced topic coherence and discriminative latent spaces across multiple datasets. The approach demonstrates strong intrinsic (coherence/diversity) and extrinsic (classification) performance, with thorough analyses of GIN vs other GNNs and the impact of the graph-threshold δ on quality and efficiency. This work advances topic modeling by explicitly capturing word-level dependencies beyond sliding windows or flat embeddings, offering a scalable, graph-based pathway for more coherent and diverse topics in large text corpora.

Abstract

Topic modeling is a widely used approach for analyzing and exploring large document collections. Recent research efforts have incorporated pre-trained contextualized language models, such as BERT embeddings, into topic modeling. However, they often neglect the intrinsic informational value conveyed by mutual dependencies between words. In this study, we introduce GINopic, a topic modeling framework based on graph isomorphism networks to capture the correlation between words. By conducting intrinsic (quantitative as well as qualitative) and extrinsic evaluations on diverse benchmark datasets, we demonstrate the effectiveness of GINopic compared to existing topic models and highlight its potential for advancing topic modeling.
Paper Structure (28 sections, 6 equations, 6 figures, 11 tables)

This paper contains 28 sections, 6 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Graph construction methodology.
  • Figure 2: Proposed framework for GINopic model.
  • Figure 3: Topic coherence (NPMI and CV) scores for each topic count for top-5 topic models on five datasets.
  • Figure 4: Latent space visualization for GINopic model across all five datasets.
  • Figure 5: Box plot of topic coherence (NPMI and CV) scores incorporating GIN, GAT, GraphSAGE, and GCN in GINopic on five datasets.
  • ...and 1 more figures