Table of Contents
Fetching ...

Predicting Award Winning Research Papers at Publication Time

Riccardo Vella, Andrea Vitaletti, Fabrizio Silvestri

TL;DR

This work tackles predicting whether a paper will win an award using only information available at publication. It models each paper with a local citation subgraph G_i(Δ) and derives both topological features and TinyBERT-based textual embeddings, which are then fused through a two-stage MLP ensemble. The approach achieves a favorable F1 score of $0.694$ when combining features, driven by high recall and robust identification of non-winning papers, and provides interpretability through analyses of neighborhood topology and textual similarity. This time-independent prediction offers researchers a early signal of potential impact and highlights how network position and text around a paper relate to future recognition.

Abstract

In recent years, many studies have been focusing on predicting the scientific impact of research papers. Most of these predictions are based on citations count or rely on features obtainable only from already published papers. In this study, we predict the likelihood for a research paper of winning an award only relying on information available at publication time. For each paper, we build the citation subgraph induced from its bibliography. We initially consider some features of this subgraph, such as the density and the global clustering coefficient, to make our prediction. Then, we mix this information with textual features, extracted from the abstract and the title, to obtain a more accurate final prediction. We made our experiments considering the ArnetMiner citation graph, while the ground truth on award-winning papers has been obtained from a collection of best paper awards from 32 computer science conferences. In our experiment, we obtained an encouraging F1 score of 0.694. Remarkably, The high recall and the low false negatives rate, show how the model performs very well at identifying papers that will not win an award. This behavior can help researchers in getting a first evaluation of their work at publication time. Lastly, we made some first experiments on interpretability. Our results highlight some interesting patterns both in topological and textual features.

Predicting Award Winning Research Papers at Publication Time

TL;DR

This work tackles predicting whether a paper will win an award using only information available at publication. It models each paper with a local citation subgraph G_i(Δ) and derives both topological features and TinyBERT-based textual embeddings, which are then fused through a two-stage MLP ensemble. The approach achieves a favorable F1 score of when combining features, driven by high recall and robust identification of non-winning papers, and provides interpretability through analyses of neighborhood topology and textual similarity. This time-independent prediction offers researchers a early signal of potential impact and highlights how network position and text around a paper relate to future recognition.

Abstract

In recent years, many studies have been focusing on predicting the scientific impact of research papers. Most of these predictions are based on citations count or rely on features obtainable only from already published papers. In this study, we predict the likelihood for a research paper of winning an award only relying on information available at publication time. For each paper, we build the citation subgraph induced from its bibliography. We initially consider some features of this subgraph, such as the density and the global clustering coefficient, to make our prediction. Then, we mix this information with textual features, extracted from the abstract and the title, to obtain a more accurate final prediction. We made our experiments considering the ArnetMiner citation graph, while the ground truth on award-winning papers has been obtained from a collection of best paper awards from 32 computer science conferences. In our experiment, we obtained an encouraging F1 score of 0.694. Remarkably, The high recall and the low false negatives rate, show how the model performs very well at identifying papers that will not win an award. This behavior can help researchers in getting a first evaluation of their work at publication time. Lastly, we made some first experiments on interpretability. Our results highlight some interesting patterns both in topological and textual features.
Paper Structure (12 sections, 5 equations, 4 figures, 1 table)

This paper contains 12 sections, 5 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: An example of a citation graph where every node is located at a certain time $t$. The example shows how the choice of the $\Delta$ parameter affects the resulting subgraph $G_i(\Delta)$.
  • Figure 2: Precision-Recall \ref{['fig:pr_curve']} curve of the final mixed model and the model's F1-score, on evaluation, visualized over the year of the predicted papers \ref{['fig:year_dist']}.
  • Figure 3: A comparison of the distributions of all $\phi$-scores \ref{['fig:phi_score']} and $\theta$-scores \ref{['fig:theta_score']} for winners and non-winners. Especially from the $\phi$-score experiment, it is clear that the scores of the winners lie in a specific range, while the distribution of the non-winners, has similar mean but a greater variance.
  • Figure 4: A comparison of the distributions of the topological features for winners and non-winners, in the form of box plots. The horizontal axis represents the true label $y$.