The Solution for The PST-KDD-2024 OAG-Challenge
Shupeng Zhong, Xinger Li, Shushan Jin, Yang Yang
TL;DR
The paper tackles the problem of identifying source papers in the PST-KDD-2024 OAG-Challenge by proposing a dual pipeline that combines BERT-based text classification with GCN-based node classification. It introduces data-cleaning strategies for XML formats, prompt-based context enrichment, and a graph that links titles, abstracts, and fragments to fuse contextual and structural information, followed by an ensemble that merges the two branches for improved accuracy. Empirically, the approach achieves competitive performance, with official submissions around 0.4769 and test evaluations up to 0.4796, demonstrating the complementary strengths of contextual language models and graph-based relational modeling. Key contributions include robust XML data cleaning, graph-based integration of multiple paper components, and an effective ensemble strategy that consolidates both semantic and structural cues.
Abstract
In this paper, we introduce the second-place solution in the KDD-2024 OAG-Challenge paper source tracing track. Our solution is mainly based on two methods, BERT and GCN, and combines the reasoning results of BERT and GCN in the final submission to achieve complementary performance. In the BERT solution, we focus on processing the fragments that appear in the references of the paper, and use a variety of operations to reduce the redundant interference in the fragments, so that the information received by BERT is more refined. In the GCN solution, we map information such as paper fragments, abstracts, and titles to a high-dimensional semantic space through an embedding model, and try to build edges between titles, abstracts, and fragments to integrate contextual relationships for judgment. In the end, our solution achieved a remarkable score of 0.47691 in the competition.
