Introducing Three New Benchmark Datasets for Hierarchical Text Classification
Jaco du Toit, Herman Redelinghuys, Marcel Dunaiski
TL;DR
This paper addresses the lack of detailed HTC benchmarks in the scientific publication domain by introducing three Web of Science-based datasets: WOS$_ ext{JT}$ (journal-based), WOS$_ ext{CT}$ (citation-based), and WOS$_ ext{JTF}$ (a filtered JT–CT hybrid). It combines journal- and citation-based classifications to improve labeling reliability and supports multi-label assignments, especially for multidisciplinary papers. The authors validate dataset quality through clustering with semantic embeddings, showing higher intra-class similarity and better separation for the filtered JT–CT dataset, and benchmark four state-of-the-art HTC methods, finding that GHLA RoBERTa and HPTD-DeBERTaV3 perform best, with notable gains on the WOS$_ ext{JTF}$ dataset. These datasets offer balanced second-level class distributions and provide robust baselines for future machine learning-based scientific publication classification.
Abstract
Hierarchical Text Classification (HTC) is a natural language processing task with the objective to classify text documents into a set of classes from a structured class hierarchy. Many HTC approaches have been proposed which attempt to leverage the class hierarchy information in various ways to improve classification performance. Machine learning-based classification approaches require large amounts of training data and are most-commonly compared through three established benchmark datasets, which include the Web Of Science (WOS), Reuters Corpus Volume 1 Version 2 (RCV1-V2) and New York Times (NYT) datasets. However, apart from the RCV1-V2 dataset which is well-documented, these datasets are not accompanied with detailed description methodologies. In this paper, we introduce three new HTC benchmark datasets in the domain of research publications which comprise the titles and abstracts of papers from the Web of Science publication database. We first create two baseline datasets which use existing journal-and citation-based classification schemas. Due to the respective shortcomings of these two existing schemas, we propose an approach which combines their classifications to improve the reliability and robustness of the dataset. We evaluate the three created datasets with a clustering-based analysis and show that our proposed approach results in a higher quality dataset where documents that belong to the same class are semantically more similar compared to the other datasets. Finally, we provide the classification performance of four state-of-the-art HTC approaches on these three new datasets to provide baselines for future studies on machine learning-based techniques for scientific publication classification.
