Table of Contents
Fetching ...

Using a Human-AI Teaming Approach to Create and Curate Scientific Datasets with the SCILIRE System

Necva Bölücü, Jessica Irons, Changhyun Lee, Brian Jin, Maciej Rybinski, Huichen Yang, Andreas Duenser, Stephen Wan

Abstract

The rapid growth of scientific literature has made manual extraction of structured knowledge increasingly impractical. To address this challenge, we introduce SCILIRE, a system for creating datasets from scientific literature. SCILIRE has been designed around Human-AI teaming principles centred on workflows for verifying and curating data. It facilitates an iterative workflow in which researchers can review and correct AI outputs. Furthermore, this interaction is used as a feedback signal to improve future LLM-based inference. We evaluate our design using a combination of intrinsic benchmarking outcomes together with real-world case studies across multiple domains. The results demonstrate that SCILIRE improves extraction fidelity and facilitates efficient dataset creation.

Using a Human-AI Teaming Approach to Create and Curate Scientific Datasets with the SCILIRE System

Abstract

The rapid growth of scientific literature has made manual extraction of structured knowledge increasingly impractical. To address this challenge, we introduce SCILIRE, a system for creating datasets from scientific literature. SCILIRE has been designed around Human-AI teaming principles centred on workflows for verifying and curating data. It facilitates an iterative workflow in which researchers can review and correct AI outputs. Furthermore, this interaction is used as a feedback signal to improve future LLM-based inference. We evaluate our design using a combination of intrinsic benchmarking outcomes together with real-world case studies across multiple domains. The results demonstrate that SCILIRE improves extraction fidelity and facilitates efficient dataset creation.
Paper Structure (46 sections, 9 figures, 12 tables, 1 algorithm)

This paper contains 46 sections, 9 figures, 12 tables, 1 algorithm.

Figures (9)

  • Figure 1: SciLire components and AI-augmented curation workflow.
  • Figure 2: Average time spent validating data from the first 20 papers across the early adopter user cases.
  • Figure 3: The Table & Figure Extraction module.
  • Figure 4: Interaction flows within SciLire for the early adopter trials. Results show that most data acceptance (locking_data) or rejections (setting_irrelevant) occur via a data verification step (either checking the provenance data or the original source PDF). "Vetting popup" here refers to the verification support tools. "Updating_value" refers to human editing and manual data curation activities. Actions with "1" at the end are used to eliminate cycles for the purposes of visualisation with a Sankey diagram.
  • Figure 5: A screenshot of SciLire for project creation.
  • ...and 4 more figures