Table of Contents
Fetching ...

ChatPD: An LLM-driven Paper-Dataset Networking System

Anjie Xu, Ruiqing Ding, Leye Wang

TL;DR

This work tackles dataset discovery and validation in scientific research by automating dataset usage extraction from papers and constructing a scalable paper-dataset network with LLMs. It introduces ChatPD, an online system with three modules—paper collection, dataset information extraction, and dataset entity resolution—and a Graph Completion and Inference strategy to map dataset descriptions to canonical entities. Empirical results show high information extraction precision (~0.99) and entity-resolution F1 (~0.88), along with the discovery of many novel datasets not present in PwC. The solution is deployed for ongoing arXiv cs.AI papers, provides dataset discovery services, and is open-sourced on GitHub, highlighting practical impact for data discovery and reproducibility in AI research.

Abstract

Scientific research heavily depends on suitable datasets for method validation, but existing academic platforms with dataset management like PapersWithCode suffer from inefficiencies in their manual workflow. To overcome this bottleneck, we present a system, called ChatPD, that utilizes Large Language Models (LLMs) to automate dataset information extraction from academic papers and construct a structured paper-dataset network. Our system consists of three key modules: \textit{paper collection}, \textit{dataset information extraction}, and \textit{dataset entity resolution} to construct paper-dataset networks. Specifically, we propose a \textit{Graph Completion and Inference} strategy to map dataset descriptions to their corresponding entities. Through extensive experiments, we demonstrate that ChatPD not only outperforms the existing platform PapersWithCode in dataset usage extraction but also achieves about 90\% precision and recall in entity resolution tasks. Moreover, we have deployed ChatPD to continuously extract which datasets are used in papers, and provide a dataset discovery service, such as task-specific dataset queries and similar dataset recommendations. We open source ChatPD and the current paper-dataset network on this [GitHub repository]{https://github.com/ChatPD-web/ChatPD}.

ChatPD: An LLM-driven Paper-Dataset Networking System

TL;DR

This work tackles dataset discovery and validation in scientific research by automating dataset usage extraction from papers and constructing a scalable paper-dataset network with LLMs. It introduces ChatPD, an online system with three modules—paper collection, dataset information extraction, and dataset entity resolution—and a Graph Completion and Inference strategy to map dataset descriptions to canonical entities. Empirical results show high information extraction precision (~0.99) and entity-resolution F1 (~0.88), along with the discovery of many novel datasets not present in PwC. The solution is deployed for ongoing arXiv cs.AI papers, provides dataset discovery services, and is open-sourced on GitHub, highlighting practical impact for data discovery and reproducibility in AI research.

Abstract

Scientific research heavily depends on suitable datasets for method validation, but existing academic platforms with dataset management like PapersWithCode suffer from inefficiencies in their manual workflow. To overcome this bottleneck, we present a system, called ChatPD, that utilizes Large Language Models (LLMs) to automate dataset information extraction from academic papers and construct a structured paper-dataset network. Our system consists of three key modules: \textit{paper collection}, \textit{dataset information extraction}, and \textit{dataset entity resolution} to construct paper-dataset networks. Specifically, we propose a \textit{Graph Completion and Inference} strategy to map dataset descriptions to their corresponding entities. Through extensive experiments, we demonstrate that ChatPD not only outperforms the existing platform PapersWithCode in dataset usage extraction but also achieves about 90\% precision and recall in entity resolution tasks. Moreover, we have deployed ChatPD to continuously extract which datasets are used in papers, and provide a dataset discovery service, such as task-specific dataset queries and similar dataset recommendations. We open source ChatPD and the current paper-dataset network on this [GitHub repository]{https://github.com/ChatPD-web/ChatPD}.

Paper Structure

This paper contains 34 sections, 2 equations, 5 figures, 7 tables, 2 algorithms.

Figures (5)

  • Figure 1: System Architecture of ChatPD.
  • Figure 2: Dataset Information Extraction Prompt
  • Figure 3: Performance of Dataset Information Extraction
  • Figure 4: Coverage of Papers with Extracted Dataset Information in arXiv cs.AI Category
  • Figure 5: Visualization of Paper-Dataset Network (F-MNIST means Fashion-MNIST, NQ means Natural Questions).