LLM-Powered Ensemble Learning for Paper Source Tracing: A GPU-Free Approach
Kunlong Chen, Junjun Wang, Zhaoqun Chen, Kunjin Chen, Yitian Chen
TL;DR
This work tackles identifying key source references for academic papers without GPU-based training by combining closed-source LLMs in zero-shot reasoning with engineered features and ensemble learning using LightGBM and CatBoost. The method employs diverse prompts and group-weighted LLM signals to augment base classifier scores, achieving a MAP of about 0.50 on a validation set and outperforming a SciBERT baseline. The results demonstrate a resource-efficient strategy that blends semantic reasoning from LLMs with structured features, suggesting promising directions for multi-LLM inference fusion in citation-source tracing and related tasks.
Abstract
We participated in the KDD CUP 2024 paper source tracing competition and achieved the 3rd place. This competition tasked participants with identifying the reference sources (i.e., ref-sources, as referred to by the organizers of the competition) of given academic papers. Unlike most teams that addressed this challenge by fine-tuning pre-trained neural language models such as BERT or ChatGLM, our primary approach utilized closed-source large language models (LLMs). With recent advancements in LLM technology, closed-source LLMs have demonstrated the capability to tackle complex reasoning tasks in zero-shot or few-shot scenarios. Consequently, in the absence of GPUs, we employed closed-source LLMs to directly generate predicted reference sources from the provided papers. We further refined these predictions through ensemble learning. Notably, our method was the only one among the award-winning approaches that did not require the use of GPUs for model training. Code available at https://github.com/Cklwanfifa/KDDCUP2024-PST.
