Table of Contents
Fetching ...

The Solution for The PST-KDD-2024 OAG-Challenge

Shupeng Zhong, Xinger Li, Shushan Jin, Yang Yang

TL;DR

The paper tackles the problem of identifying source papers in the PST-KDD-2024 OAG-Challenge by proposing a dual pipeline that combines BERT-based text classification with GCN-based node classification. It introduces data-cleaning strategies for XML formats, prompt-based context enrichment, and a graph that links titles, abstracts, and fragments to fuse contextual and structural information, followed by an ensemble that merges the two branches for improved accuracy. Empirically, the approach achieves competitive performance, with official submissions around 0.4769 and test evaluations up to 0.4796, demonstrating the complementary strengths of contextual language models and graph-based relational modeling. Key contributions include robust XML data cleaning, graph-based integration of multiple paper components, and an effective ensemble strategy that consolidates both semantic and structural cues.

Abstract

In this paper, we introduce the second-place solution in the KDD-2024 OAG-Challenge paper source tracing track. Our solution is mainly based on two methods, BERT and GCN, and combines the reasoning results of BERT and GCN in the final submission to achieve complementary performance. In the BERT solution, we focus on processing the fragments that appear in the references of the paper, and use a variety of operations to reduce the redundant interference in the fragments, so that the information received by BERT is more refined. In the GCN solution, we map information such as paper fragments, abstracts, and titles to a high-dimensional semantic space through an embedding model, and try to build edges between titles, abstracts, and fragments to integrate contextual relationships for judgment. In the end, our solution achieved a remarkable score of 0.47691 in the competition.

The Solution for The PST-KDD-2024 OAG-Challenge

TL;DR

The paper tackles the problem of identifying source papers in the PST-KDD-2024 OAG-Challenge by proposing a dual pipeline that combines BERT-based text classification with GCN-based node classification. It introduces data-cleaning strategies for XML formats, prompt-based context enrichment, and a graph that links titles, abstracts, and fragments to fuse contextual and structural information, followed by an ensemble that merges the two branches for improved accuracy. Empirically, the approach achieves competitive performance, with official submissions around 0.4769 and test evaluations up to 0.4796, demonstrating the complementary strengths of contextual language models and graph-based relational modeling. Key contributions include robust XML data cleaning, graph-based integration of multiple paper components, and an effective ensemble strategy that consolidates both semantic and structural cues.

Abstract

In this paper, we introduce the second-place solution in the KDD-2024 OAG-Challenge paper source tracing track. Our solution is mainly based on two methods, BERT and GCN, and combines the reasoning results of BERT and GCN in the final submission to achieve complementary performance. In the BERT solution, we focus on processing the fragments that appear in the references of the paper, and use a variety of operations to reduce the redundant interference in the fragments, so that the information received by BERT is more refined. In the GCN solution, we map information such as paper fragments, abstracts, and titles to a high-dimensional semantic space through an embedding model, and try to build edges between titles, abstracts, and fragments to integrate contextual relationships for judgment. In the end, our solution achieved a remarkable score of 0.47691 in the competition.
Paper Structure (10 sections, 1 equation, 3 figures, 4 tables)

This paper contains 10 sections, 1 equation, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Train Dataset For Roberta
  • Figure 2: Our GCN Mapping Method
  • Figure 3: Training a BERT-like model using a large language model-processed dataset