Table of Contents
Fetching ...

Clustering scientific publications: lessons learned through experiments with a real citation network

Vu Thi Huong, Thorsten Koch

TL;DR

Clustering scientific publications aims to reveal underlying research structure in bibliographic data, but applying standard graph clustering to real-world large citation networks tests scalability and robustness. The authors compare spectral, Louvain, and Leiden methods on a large Web of Science graph for Mathematics and OR&MS, finding that spectral clustering does not scale while Louvain and Leiden are fast but require careful parameter tuning. A tuned Leiden solution yields two dominant clusters that largely align with the known fields and exhibit high intra-cluster connectivity and purity, illustrating practical gains and remaining limitations. The work highlights the importance of data-aware tuning and points to future directions in soft clustering and taxonomy reevaluation to better model interdisciplinarity in bibliometrics.

Abstract

Clustering scientific publications can reveal underlying research structures within bibliographic databases. Graph-based clustering methods, such as spectral, Louvain, and Leiden algorithms, are frequently utilized due to their capacity to effectively model citation networks. However, their performance may degrade when applied to real-world data. This study evaluates the performance of these clustering algorithms on a citation graph comprising approx. 700,000 papers and 4.6 million citations extracted from Web of Science. The results show that while scalable methods like Louvain and Leiden perform efficiently, their default settings often yield poor partitioning. Meaningful outcomes require careful parameter tuning, especially for large networks with uneven structures, including a dense core and loosely connected papers. These findings highlight practical lessons about the challenges of large-scale data, method selection and tuning based on specific structures of bibliometric clustering tasks.

Clustering scientific publications: lessons learned through experiments with a real citation network

TL;DR

Clustering scientific publications aims to reveal underlying research structure in bibliographic data, but applying standard graph clustering to real-world large citation networks tests scalability and robustness. The authors compare spectral, Louvain, and Leiden methods on a large Web of Science graph for Mathematics and OR&MS, finding that spectral clustering does not scale while Louvain and Leiden are fast but require careful parameter tuning. A tuned Leiden solution yields two dominant clusters that largely align with the known fields and exhibit high intra-cluster connectivity and purity, illustrating practical gains and remaining limitations. The work highlights the importance of data-aware tuning and points to future directions in soft clustering and taxonomy reevaluation to better model interdisciplinarity in bibliometrics.

Abstract

Clustering scientific publications can reveal underlying research structures within bibliographic databases. Graph-based clustering methods, such as spectral, Louvain, and Leiden algorithms, are frequently utilized due to their capacity to effectively model citation networks. However, their performance may degrade when applied to real-world data. This study evaluates the performance of these clustering algorithms on a citation graph comprising approx. 700,000 papers and 4.6 million citations extracted from Web of Science. The results show that while scalable methods like Louvain and Leiden perform efficiently, their default settings often yield poor partitioning. Meaningful outcomes require careful parameter tuning, especially for large networks with uneven structures, including a dense core and loosely connected papers. These findings highlight practical lessons about the challenges of large-scale data, method selection and tuning based on specific structures of bibliometric clustering tasks.

Paper Structure

This paper contains 4 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Solution by Leiden alg.: Paper distribution among clusters by size
  • Figure 2: Solution by Leiden alg.: Heatmap of link distribution between clusters (left) and zoom-in for 5 biggest clusters (right)