Table of Contents
Fetching ...

Can LLMs Predict Citation Intent? An Experimental Analysis of In-context Learning and Fine-tuning on Open LLMs

Paris Koloveas, Serafeim Chatzopoulos, Thanasis Vergoulis, Christos Tryfonopoulos

TL;DR

This work shows that open, general-purpose LLMs can predict citation intent with strong performance using in-context learning across diverse prompting configurations, without domain-specific pretraining. It systematically evaluates twelve models from five families on two standard datasets, identifying prompting and configuration patterns that yield high F1-scores. Crucially, supervised fine-tuning with LoRA on the best model provides substantial improvements (approximately $8\%$ on SciCite and $4.3\%$ on ACL-ARC) and brings performance close to specialized systems, while preserving ease of deployment. The authors also release their evaluation framework and fine-tuned weights to support future research in scientometrics and prompt-engineering.

Abstract

This work investigates the ability of open Large Language Models (LLMs) to predict citation intent through in-context learning and fine-tuning. Unlike traditional approaches relying on domain-specific pre-trained models like SciBERT, we demonstrate that general-purpose LLMs can be adapted to this task with minimal task-specific data. We evaluate twelve model variations across five prominent open LLM families using zero-, one-, few-, and many-shot prompting. Our experimental study identifies the top-performing model and prompting parameters through extensive in-context learning experiments. We then demonstrate the significant impact of task-specific adaptation by fine-tuning this model, achieving a relative F1-score improvement of 8% on the SciCite dataset and 4.3% on the ACL-ARC dataset compared to the instruction-tuned baseline. These findings provide valuable insights for model selection and prompt engineering. Additionally, we make our end-to-end evaluation framework and models openly available for future use.

Can LLMs Predict Citation Intent? An Experimental Analysis of In-context Learning and Fine-tuning on Open LLMs

TL;DR

This work shows that open, general-purpose LLMs can predict citation intent with strong performance using in-context learning across diverse prompting configurations, without domain-specific pretraining. It systematically evaluates twelve models from five families on two standard datasets, identifying prompting and configuration patterns that yield high F1-scores. Crucially, supervised fine-tuning with LoRA on the best model provides substantial improvements (approximately on SciCite and on ACL-ARC) and brings performance close to specialized systems, while preserving ease of deployment. The authors also release their evaluation framework and fine-tuned weights to support future research in scientometrics and prompt-engineering.

Abstract

This work investigates the ability of open Large Language Models (LLMs) to predict citation intent through in-context learning and fine-tuning. Unlike traditional approaches relying on domain-specific pre-trained models like SciBERT, we demonstrate that general-purpose LLMs can be adapted to this task with minimal task-specific data. We evaluate twelve model variations across five prominent open LLM families using zero-, one-, few-, and many-shot prompting. Our experimental study identifies the top-performing model and prompting parameters through extensive in-context learning experiments. We then demonstrate the significant impact of task-specific adaptation by fine-tuning this model, achieving a relative F1-score improvement of 8% on the SciCite dataset and 4.3% on the ACL-ARC dataset compared to the instruction-tuned baseline. These findings provide valuable insights for model selection and prompt engineering. Additionally, we make our end-to-end evaluation framework and models openly available for future use.

Paper Structure

This paper contains 22 sections, 1 equation, 1 figure, 11 tables.

Figures (1)

  • Figure 1: Our System Prompts.