Table of Contents
Fetching ...

Benchmark for Evaluation and Analysis of Citation Recommendation Models

Puja Maharjan

TL;DR

The paper tackles the lack of standardized benchmarks for citation-recommendation methods by proposing a benchmark focused on local citation-context analysis. It constructs diagnostic datasets from S2ORC (full text) and S2AG (metadata), and uses stratified sampling across fields, years, and citation counts to enable controlled, cross-domain evaluation of features such as context length, location, type, and surrounding linguistic cues. Evaluations of BM25 and neural models (e.g., NCN, LCR, Galactica) show BM25 often robust across tasks, while neural models exhibit context- and domain-dependent strengths, highlighting the benchmark’s utility for fair comparisons. The work provides a standardized platform with data pipelines, preprocessing, and evaluation protocols to guide future development of citation-recommendation methods and facilitate meaningful, reproducible progress in the field.

Abstract

Citation recommendation systems have attracted much academic interest, resulting in many studies and implementations. These systems help authors automatically generate proper citations by suggesting relevant references based on the text they have written. However, the methods used in citation recommendation differ across various studies and implementations. Some approaches focus on the overall content of papers, while others consider the context of the citation text. Additionally, the datasets used in these studies include different aspects of papers, such as metadata, citation context, or even the full text of the paper in various formats and structures. The diversity in models, datasets, and evaluation metrics makes it challenging to assess and compare citation recommendation methods effectively. To address this issue, a standardized dataset and evaluation metrics are needed to evaluate these models consistently. Therefore, we propose developing a benchmark specifically designed to analyze and compare citation recommendation models. This benchmark will evaluate the performance of models on different features of the citation context and provide a comprehensive evaluation of the models across all these tasks, presenting the results in a standardized way. By creating a benchmark with standardized evaluation metrics, researchers and practitioners in the field of citation recommendation will have a common platform to assess and compare different models. This will enable meaningful comparisons and help identify promising approaches for further research and development in the field.

Benchmark for Evaluation and Analysis of Citation Recommendation Models

TL;DR

The paper tackles the lack of standardized benchmarks for citation-recommendation methods by proposing a benchmark focused on local citation-context analysis. It constructs diagnostic datasets from S2ORC (full text) and S2AG (metadata), and uses stratified sampling across fields, years, and citation counts to enable controlled, cross-domain evaluation of features such as context length, location, type, and surrounding linguistic cues. Evaluations of BM25 and neural models (e.g., NCN, LCR, Galactica) show BM25 often robust across tasks, while neural models exhibit context- and domain-dependent strengths, highlighting the benchmark’s utility for fair comparisons. The work provides a standardized platform with data pipelines, preprocessing, and evaluation protocols to guide future development of citation-recommendation methods and facilitate meaningful, reproducible progress in the field.

Abstract

Citation recommendation systems have attracted much academic interest, resulting in many studies and implementations. These systems help authors automatically generate proper citations by suggesting relevant references based on the text they have written. However, the methods used in citation recommendation differ across various studies and implementations. Some approaches focus on the overall content of papers, while others consider the context of the citation text. Additionally, the datasets used in these studies include different aspects of papers, such as metadata, citation context, or even the full text of the paper in various formats and structures. The diversity in models, datasets, and evaluation metrics makes it challenging to assess and compare citation recommendation methods effectively. To address this issue, a standardized dataset and evaluation metrics are needed to evaluate these models consistently. Therefore, we propose developing a benchmark specifically designed to analyze and compare citation recommendation models. This benchmark will evaluate the performance of models on different features of the citation context and provide a comprehensive evaluation of the models across all these tasks, presenting the results in a standardized way. By creating a benchmark with standardized evaluation metrics, researchers and practitioners in the field of citation recommendation will have a common platform to assess and compare different models. This will enable meaningful comparisons and help identify promising approaches for further research and development in the field.

Paper Structure

This paper contains 15 sections, 1 equation, 8 figures, 1 table.

Figures (8)

  • Figure 1: Citation position in the context sentences.
  • Figure 2: Combined preceding POS of the citation.
  • Figure 3: Data distribution of papers according to various fields
  • Figure 4: Data distribution of papers, from years 2000 to 2023 with respect to fields
  • Figure 5: Citation count distribution based on fields, where the citation count value is normalized based on total papers in each fields.
  • ...and 3 more figures