Table of Contents
Fetching ...

Topics as Entity Clusters: Entity-based Topics from Large Language Models and Graph Neural Networks

Manuel V. Loureiro, Steven Derby, Tri Kurniawan Wijaya

TL;DR

This paper addresses the sparsity and interpretability challenges of word-based topic models by adopting entity-based topics. It introduces Topics as Entity Clusters (TEC), which represents entities through a fused embedding that combines implicit knowledge from large language models with explicit knowledge from a knowledge graph, and clusters these embeddings to form topics. Across Wikipedia, CC-News, and MLSUM, TEC consistently outperforms baselines, with graph-based (explicit) embeddings delivering the strongest gains in topic coherence and quality, while maintaining language-agnostic topic representations. The work demonstrates the practical value of integrating structured knowledge into neural topic modeling and points to future improvements via deeper graph models and human-centered evaluations.

Abstract

Topic models aim to reveal latent structures within a corpus of text, typically through the use of term-frequency statistics over bag-of-words representations from documents. In recent years, conceptual entities -- interpretable, language-independent features linked to external knowledge resources -- have been used in place of word-level tokens, as words typically require extensive language processing with a minimal assurance of interpretability. However, current literature is limited when it comes to exploring purely entity-driven neural topic modeling. For instance, despite the advantages of using entities for eliciting thematic structure, it is unclear whether current techniques are compatible with these sparsely organised, information-dense conceptual units. In this work, we explore entity-based neural topic modeling and propose a novel topic clustering approach using bimodal vector representations of entities. Concretely, we extract these latent representations from large language models and graph neural networks trained on a knowledge base of symbolic relations, in order to derive the most salient aspects of these conceptual units. Analysis of coherency metrics confirms that our approach is better suited to working with entities in comparison to state-of-the-art models, particularly when using graph-based embeddings trained on a knowledge base.

Topics as Entity Clusters: Entity-based Topics from Large Language Models and Graph Neural Networks

TL;DR

This paper addresses the sparsity and interpretability challenges of word-based topic models by adopting entity-based topics. It introduces Topics as Entity Clusters (TEC), which represents entities through a fused embedding that combines implicit knowledge from large language models with explicit knowledge from a knowledge graph, and clusters these embeddings to form topics. Across Wikipedia, CC-News, and MLSUM, TEC consistently outperforms baselines, with graph-based (explicit) embeddings delivering the strongest gains in topic coherence and quality, while maintaining language-agnostic topic representations. The work demonstrates the practical value of integrating structured knowledge into neural topic modeling and points to future improvements via deeper graph models and human-centered evaluations.

Abstract

Topic models aim to reveal latent structures within a corpus of text, typically through the use of term-frequency statistics over bag-of-words representations from documents. In recent years, conceptual entities -- interpretable, language-independent features linked to external knowledge resources -- have been used in place of word-level tokens, as words typically require extensive language processing with a minimal assurance of interpretability. However, current literature is limited when it comes to exploring purely entity-driven neural topic modeling. For instance, despite the advantages of using entities for eliciting thematic structure, it is unclear whether current techniques are compatible with these sparsely organised, information-dense conceptual units. In this work, we explore entity-based neural topic modeling and propose a novel topic clustering approach using bimodal vector representations of entities. Concretely, we extract these latent representations from large language models and graph neural networks trained on a knowledge base of symbolic relations, in order to derive the most salient aspects of these conceptual units. Analysis of coherency metrics confirms that our approach is better suited to working with entities in comparison to state-of-the-art models, particularly when using graph-based embeddings trained on a knowledge base.
Paper Structure (35 sections, 5 equations, 1 figure, 7 tables, 1 algorithm)

This paper contains 35 sections, 5 equations, 1 figure, 7 tables, 1 algorithm.

Figures (1)

  • Figure 1: Overview of Topics as Entity Clusters (TEC). The top half illustrates the processing of entity embeddings, topic centroids and top entities per topic, while the bottom half inferencing the top topics per document.