Table of Contents
Fetching ...

Graph-Convolutional Networks: Named Entity Recognition and Large Language Model Embedding in Document Clustering

Imed Keraghel, Mohamed Nadif

TL;DR

This work tackles document clustering by integrating Named Entity Recognition (NER) and Large Language Model (LLM) embeddings within a graph-based framework to capture deep semantic relationships beyond co-occurrence. It builds an entity-context graph using a four-step NE similarity pipeline and jointly optimizes embeddings and clustering with a Graph Convolutional Network (GCN) objective. Experiments on English and French datasets show that GCC* with an entity-based adjacency and LLM embeddings outperforms co-occurrence- and KNN-based baselines, particularly for entity-rich documents, and embedding visualizations confirm improved separability. The results underscore the practical potential of combining NE signals with contextual embeddings for more effective document clustering in real-world corpora.

Abstract

Recent advances in machine learning, particularly Large Language Models (LLMs) such as BERT and GPT, provide rich contextual embeddings that improve text representation. However, current document clustering approaches often ignore the deeper relationships between named entities (NEs) and the potential of LLM embeddings. This paper proposes a novel approach that integrates Named Entity Recognition (NER) and LLM embeddings within a graph-based framework for document clustering. The method builds a graph with nodes representing documents and edges weighted by named entity similarity, optimized using a graph-convolutional network (GCN). This ensures a more effective grouping of semantically related documents. Experimental results indicate that our approach outperforms conventional co-occurrence-based methods in clustering, notably for documents rich in named entities.

Graph-Convolutional Networks: Named Entity Recognition and Large Language Model Embedding in Document Clustering

TL;DR

This work tackles document clustering by integrating Named Entity Recognition (NER) and Large Language Model (LLM) embeddings within a graph-based framework to capture deep semantic relationships beyond co-occurrence. It builds an entity-context graph using a four-step NE similarity pipeline and jointly optimizes embeddings and clustering with a Graph Convolutional Network (GCN) objective. Experiments on English and French datasets show that GCC* with an entity-based adjacency and LLM embeddings outperforms co-occurrence- and KNN-based baselines, particularly for entity-rich documents, and embedding visualizations confirm improved separability. The results underscore the practical potential of combining NE signals with contextual embeddings for more effective document clustering in real-world corpora.

Abstract

Recent advances in machine learning, particularly Large Language Models (LLMs) such as BERT and GPT, provide rich contextual embeddings that improve text representation. However, current document clustering approaches often ignore the deeper relationships between named entities (NEs) and the potential of LLM embeddings. This paper proposes a novel approach that integrates Named Entity Recognition (NER) and LLM embeddings within a graph-based framework for document clustering. The method builds a graph with nodes representing documents and edges weighted by named entity similarity, optimized using a graph-convolutional network (GCN). This ensures a more effective grouping of semantically related documents. Experimental results indicate that our approach outperforms conventional co-occurrence-based methods in clustering, notably for documents rich in named entities.

Paper Structure

This paper contains 26 sections, 12 equations, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: Comparison of graph structures for the BBC News dataset. (Left) KNN-based graph with lexical similarity; (Right) NER-based graph capturing entity similarities.
  • Figure 2: Overview of the proposed model pipeline: LLM-based feature extraction, NER-based graph construction, and joint embedding and clustering.
  • Figure 3: Comparison of dendrograms for different datasets.
  • Figure 4: UMAP projection of the cluster embeddings obtained with GPT ($\mathbf{X}_{\ell \ell m}$) compared to those obtained on $\mathbf{Y}^p\mathbf{W}$ derived from GCC$^*$.