Table of Contents
Fetching ...

Revisiting Task-Oriented Dataset Search in the Era of Large Language Models: Challenges, Benchmark, and Solution

Zixin Wei, Yucan Guo, Jinyang Li, Xiaolin Han, Xiaolong Jin, Chenhao Ma

TL;DR

The paper tackles the challenge of task-oriented dataset search by introducing KATS, a two-stage system that builds a task-dataset knowledge graph offline and uses a hybrid vector+graph online retrieval to surface relevant datasets from unstructured scientific literature. It additionally introduces CS-TDS, a standardized benchmark for evaluating task-oriented dataset search, and demonstrates that KATS outperforms state-of-the-art retrieval frameworks in both effectiveness and efficiency. Core innovations include a collaborative multi-agent information extraction pipeline, an entity-resolution mechanism to address dataset naming ambiguities, and a KG-based task expansion using Personalized PageRank. The results establish KATS as a scalable blueprint for dataset discovery, with strong incremental-update capabilities and a demonstrated real-world impact via case studies and robust ablations.

Abstract

The search for suitable datasets is the critical "first step" in data-driven research, but it remains a great challenge. Researchers often need to search for datasets based on high-level task descriptions. However, existing search systems struggle with this task due to ambiguous user intent, task-to-dataset mapping and benchmark gaps, and entity ambiguity. To address these challenges, we introduce KATS, a novel end-to-end system for task-oriented dataset search from unstructured scientific literature. KATS consists of two key components, i.e., offline knowledge base construction and online query processing. The sophisticated offline pipeline automatically constructs a high-quality, dynamically updatable task-dataset knowledge graph by employing a collaborative multi-agent framework for information extraction, thereby filling the task-to-dataset mapping gap. To further address the challenge of entity ambiguity, a unique semantic-based mechanism is used for task entity linking and dataset entity resolution. For online retrieval, KATS utilizes a specialized hybrid query engine that combines vector search with graph-based ranking to generate highly relevant results. Additionally, we introduce CS-TDS, a tailored benchmark suite for evaluating task-oriented dataset search systems, addressing the critical gap in standardized evaluation. Experiments on our benchmark suite show that KATS significantly outperforms state-of-the-art retrieval-augmented generation frameworks in both effectiveness and efficiency, providing a robust blueprint for the next generation of dataset discovery systems.

Revisiting Task-Oriented Dataset Search in the Era of Large Language Models: Challenges, Benchmark, and Solution

TL;DR

The paper tackles the challenge of task-oriented dataset search by introducing KATS, a two-stage system that builds a task-dataset knowledge graph offline and uses a hybrid vector+graph online retrieval to surface relevant datasets from unstructured scientific literature. It additionally introduces CS-TDS, a standardized benchmark for evaluating task-oriented dataset search, and demonstrates that KATS outperforms state-of-the-art retrieval frameworks in both effectiveness and efficiency. Core innovations include a collaborative multi-agent information extraction pipeline, an entity-resolution mechanism to address dataset naming ambiguities, and a KG-based task expansion using Personalized PageRank. The results establish KATS as a scalable blueprint for dataset discovery, with strong incremental-update capabilities and a demonstrated real-world impact via case studies and robust ablations.

Abstract

The search for suitable datasets is the critical "first step" in data-driven research, but it remains a great challenge. Researchers often need to search for datasets based on high-level task descriptions. However, existing search systems struggle with this task due to ambiguous user intent, task-to-dataset mapping and benchmark gaps, and entity ambiguity. To address these challenges, we introduce KATS, a novel end-to-end system for task-oriented dataset search from unstructured scientific literature. KATS consists of two key components, i.e., offline knowledge base construction and online query processing. The sophisticated offline pipeline automatically constructs a high-quality, dynamically updatable task-dataset knowledge graph by employing a collaborative multi-agent framework for information extraction, thereby filling the task-to-dataset mapping gap. To further address the challenge of entity ambiguity, a unique semantic-based mechanism is used for task entity linking and dataset entity resolution. For online retrieval, KATS utilizes a specialized hybrid query engine that combines vector search with graph-based ranking to generate highly relevant results. Additionally, we introduce CS-TDS, a tailored benchmark suite for evaluating task-oriented dataset search systems, addressing the critical gap in standardized evaluation. Experiments on our benchmark suite show that KATS significantly outperforms state-of-the-art retrieval-augmented generation frameworks in both effectiveness and efficiency, providing a robust blueprint for the next generation of dataset discovery systems.

Paper Structure

This paper contains 59 sections, 2 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Overall architecture of the KATS system.
  • Figure 2: Ontology and an example of the task-dataset KG.
  • Figure 3: The query processing pipeline of KATS system.
  • Figure 4: End-to-end effectiveness comparison in terms of Hit Rate@5 on $\text{CS-TDS}_{M}$ and $\text{CS-TDS}_{L}$.
  • Figure 5: Effectiveness vs. efficiency trade-offs in KATS.
  • ...and 3 more figures