Table of Contents
Fetching ...

Which topics are best represented by science maps? An analysis of clustering effectiveness for citation and text similarity networks

Juan Pablo Bascur, Suzan Verberne, Nees Jan van Eck, Ludo Waltman

TL;DR

This study investigates which MeSH-defined topic categories are best represented in science maps built from biomedical publication data, comparing clustering effectiveness in citation versus text similarity networks. Using Leiden clustering on a PubMed-derived dataset, with MeSH term expansion and branch-based topic categories, it computes Purity, ICC, and log-ratio metrics to assess per-topic representation across multiple parameter settings. The main findings show that diseases, psychology, anatomy, organisms, and diagnostic/therapeutic techniques cluster most effectively, while natural science disciplines, geographical terms, information science, and health-care occupations cluster least; diseases and organisms notably exhibit higher clustering effectiveness in citation networks for smaller clusters. These results, contingent on parameters like Resolution and Coverage, offer practical guidance for constructing science maps and delimiting fields, and are supported by publicly available data and code for replication.

Abstract

A science map of topics is a visualization that shows topics identified algorithmically based on the bibliographic metadata of scientific publications. In practice not all topics are well represented in a science map. We analyzed how effectively different topics are represented in science maps created by clustering biomedical publications. To achieve this, we investigated which topic categories, obtained from MeSH terms, are better represented in science maps based on citation or text similarity networks. To evaluate the clustering effectiveness of topics, we determined the extent to which documents belonging to the same topic are grouped together in the same cluster. We found that the best and worst represented topic categories are the same for citation and text similarity networks. The best represented topic categories are diseases, psychology, anatomy, organisms and the techniques and equipment used for diagnostics and therapy, while the worst represented topic categories are natural science fields, geographical entities, information sciences and health care and occupations. Furthermore, for the diseases and organisms topic categories and for science maps with smaller clusters, we found that topics tend to be better represented in citation similarity networks than in text similarity networks.

Which topics are best represented by science maps? An analysis of clustering effectiveness for citation and text similarity networks

TL;DR

This study investigates which MeSH-defined topic categories are best represented in science maps built from biomedical publication data, comparing clustering effectiveness in citation versus text similarity networks. Using Leiden clustering on a PubMed-derived dataset, with MeSH term expansion and branch-based topic categories, it computes Purity, ICC, and log-ratio metrics to assess per-topic representation across multiple parameter settings. The main findings show that diseases, psychology, anatomy, organisms, and diagnostic/therapeutic techniques cluster most effectively, while natural science disciplines, geographical terms, information science, and health-care occupations cluster least; diseases and organisms notably exhibit higher clustering effectiveness in citation networks for smaller clusters. These results, contingent on parameters like Resolution and Coverage, offer practical guidance for constructing science maps and delimiting fields, and are supported by publicly available data and code for replication.

Abstract

A science map of topics is a visualization that shows topics identified algorithmically based on the bibliographic metadata of scientific publications. In practice not all topics are well represented in a science map. We analyzed how effectively different topics are represented in science maps created by clustering biomedical publications. To achieve this, we investigated which topic categories, obtained from MeSH terms, are better represented in science maps based on citation or text similarity networks. To evaluate the clustering effectiveness of topics, we determined the extent to which documents belonging to the same topic are grouped together in the same cluster. We found that the best and worst represented topic categories are the same for citation and text similarity networks. The best represented topic categories are diseases, psychology, anatomy, organisms and the techniques and equipment used for diagnostics and therapy, while the worst represented topic categories are natural science fields, geographical entities, information sciences and health care and occupations. Furthermore, for the diseases and organisms topic categories and for science maps with smaller clusters, we found that topics tend to be better represented in citation similarity networks than in text similarity networks.
Paper Structure (31 sections, 2 equations, 3 figures, 3 tables)

This paper contains 31 sections, 2 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Box plots showing the distribution of C-Purity, C-ICC, T-Purity and T-ICC for each branch. The median values of each box plot are reported along the right Y axis. The branches are sorted as in Table \ref{['table:ranking_purity']}.
  • Figure 2: Box plots showing the distribution of rPurity and rICC for each value of Size bin, Resolution and Coverage.
  • Figure 3: Box plots showing the distribution of rPurity and rICC for each branch.