CiMaTe: Citation Count Prediction Effectively Leveraging the Main Text
Jun Hirako, Ryohei Sasano, Koichi Takeda
TL;DR
The paper tackles predicting future citation counts by leveraging the full main text. It introduces CiMaTe, a BERT-based approach that encodes each paper section to capture structural information, with two variants: CiMaTe_b (beginning-of-sections) and CiMaTe_w (full-content chunks) and Transformer-based pooling. Across CL and Bio datasets, CiMaTe outperforms baselines that use only titles/abstracts or less structured text, achieving notable gains in Spearman's ρ (e.g., +5.1 points on CL and +1.8 on Bio) while offering a favorable cost-accuracy trade-off. The work demonstrates the value of structured main-text representations for long-document bibliometric tasks and suggests directions to incorporate figures, tables, author data, and citation graphs for further improvements.
Abstract
Prediction of the future citation counts of papers is increasingly important to find interesting papers among an ever-growing number of papers. Although a paper's main text is an important factor for citation count prediction, it is difficult to handle in machine learning models because the main text is typically very long; thus previous studies have not fully explored how to leverage it. In this paper, we propose a BERT-based citation count prediction model, called CiMaTe, that leverages the main text by explicitly capturing a paper's sectional structure. Through experiments with papers from computational linguistics and biology domains, we demonstrate the CiMaTe's effectiveness, outperforming the previous methods in Spearman's rank correlation coefficient; 5.1 points in the computational linguistics domain and 1.8 points in the biology domain.
