Table of Contents
Fetching ...

Hierarchical Classification of Research Fields in the "Web of Science" Using Deep Learning

Susie Xi Rao, Peter H. Egger, Ce Zhang

TL;DR

This work addresses the need for a global, hierarchical taxonomy to index scholarly publications by discipline, field, and subfield and to analyze interdisciplinarity at scale. It proposes a modular, three-level neural classification system that can operate in single-label or multi-label modes, trained on MAG abstracts and aligned with Wikipedia and domain-specific taxonomies. The authors demonstrate strong performance (often >90% accuracy) across 44 disciplines, 718 fields, and 1,485 subfields, with CNN/RNN as robust baselines and Transformers aiding in hard cases, and they introduce an LMDB-based data pipeline for scalable preprocessing. Beyond classification, the paper contributes methodologies and metrics for interfield and interdisciplinarity analysis, enabling automated indexing and insight into cross-disciplinary knowledge flows with practical implications for research evaluation and policy.

Abstract

This paper presents a hierarchical classification system that automatically categorizes a scholarly publication using its abstract into a three-tier hierarchical label set (discipline, field, subfield) in a multi-class setting. This system enables a holistic categorization of research activities in the mentioned hierarchy in terms of knowledge production through articles and impact through citations, permitting those activities to fall into multiple categories. The classification system distinguishes 44 disciplines, 718 fields and 1,485 subfields among 160 million abstract snippets in Microsoft Academic Graph (version 2018-05-17). We used batch training in a modularized and distributed fashion to address and allow for interdisciplinary and interfield classifications in single-label and multi-label settings. In total, we have conducted 3,140 experiments in all considered models (Convolutional Neural Networks, Recurrent Neural Networks, Transformers). The classification accuracy is > 90% in 77.13% and 78.19% of the single-label and multi-label classifications, respectively. We examine the advantages of our classification by its ability to better align research texts and output with disciplines, to adequately classify them in an automated way, and to capture the degree of interdisciplinarity. The proposed system (a set of pre-trained models) can serve as a backbone to an interactive system for indexing scientific publications in the future.

Hierarchical Classification of Research Fields in the "Web of Science" Using Deep Learning

TL;DR

This work addresses the need for a global, hierarchical taxonomy to index scholarly publications by discipline, field, and subfield and to analyze interdisciplinarity at scale. It proposes a modular, three-level neural classification system that can operate in single-label or multi-label modes, trained on MAG abstracts and aligned with Wikipedia and domain-specific taxonomies. The authors demonstrate strong performance (often >90% accuracy) across 44 disciplines, 718 fields, and 1,485 subfields, with CNN/RNN as robust baselines and Transformers aiding in hard cases, and they introduce an LMDB-based data pipeline for scalable preprocessing. Beyond classification, the paper contributes methodologies and metrics for interfield and interdisciplinarity analysis, enabling automated indexing and insight into cross-disciplinary knowledge flows with practical implications for research evaluation and policy.

Abstract

This paper presents a hierarchical classification system that automatically categorizes a scholarly publication using its abstract into a three-tier hierarchical label set (discipline, field, subfield) in a multi-class setting. This system enables a holistic categorization of research activities in the mentioned hierarchy in terms of knowledge production through articles and impact through citations, permitting those activities to fall into multiple categories. The classification system distinguishes 44 disciplines, 718 fields and 1,485 subfields among 160 million abstract snippets in Microsoft Academic Graph (version 2018-05-17). We used batch training in a modularized and distributed fashion to address and allow for interdisciplinary and interfield classifications in single-label and multi-label settings. In total, we have conducted 3,140 experiments in all considered models (Convolutional Neural Networks, Recurrent Neural Networks, Transformers). The classification accuracy is > 90% in 77.13% and 78.19% of the single-label and multi-label classifications, respectively. We examine the advantages of our classification by its ability to better align research texts and output with disciplines, to adequately classify them in an automated way, and to capture the degree of interdisciplinarity. The proposed system (a set of pre-trained models) can serve as a backbone to an interactive system for indexing scientific publications in the future.
Paper Structure (50 sections, 2 equations, 16 figures, 4 tables)

This paper contains 50 sections, 2 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: Discipline hierarchy of the Wikipedia taxonomy. Note that we only have to classify the leaf nodes, which leaves us with 44 disciplines (marked with (*)).
  • Figure 2: Discipline (JEL) publication mapping using FOS tags from MAG.
  • Figure 3: Three-level hierarchical classification system.
  • Figure 4: Distributions of total papers, number of fields and their ratio in 44 disciplines.
  • Figure 5: Box plots for demand and supply of 44 disciplines.
  • ...and 11 more figures