Table of Contents
Fetching ...

SciDaSynth: Interactive Structured Data Extraction from Scientific Literature with Large Language Model

Xingbo Wang, Samantha L. Huey, Rui Sheng, Saurabh Mehta, Fei Wang

TL;DR

SciDaSynth tackles the challenge of efficiently extracting structured data from multimodal scientific literature by integrating retrieval-augmented generation with interactive validation and semantic grouping. The system automatically generates structured data tables from text, tables, and figures in response to user queries, while offering visual summaries and grouping to resolve cross-document inconsistencies. In a within-subject study across nutrition and NLP domains, SciDaSynth delivered higher data quality and substantially faster data extraction than manual baselines, with users reporting streamlined workflows and effective validation tools, yet remaining cautious about AI outputs. The work highlights design considerations for human-AI collaboration in data extraction and offers a public codebase for replication.

Abstract

The explosion of scientific literature has made the efficient and accurate extraction of structured data a critical component for advancing scientific knowledge and supporting evidence-based decision-making. However, existing tools often struggle to extract and structure multimodal, varied, and inconsistent information across documents into standardized formats. We introduce SciDaSynth, a novel interactive system powered by large language models (LLMs) that automatically generates structured data tables according to users' queries by integrating information from diverse sources, including text, tables, and figures. Furthermore, SciDaSynth supports efficient table data validation and refinement, featuring multi-faceted visual summaries and semantic grouping capabilities to resolve cross-document data inconsistencies. A within-subjects study with nutrition and NLP researchers demonstrates SciDaSynth's effectiveness in producing high-quality structured data more efficiently than baseline methods. We discuss design implications for human-AI collaborative systems supporting data extraction tasks. The system code is available at https://github.com/xingbow/SciDaEx

SciDaSynth: Interactive Structured Data Extraction from Scientific Literature with Large Language Model

TL;DR

SciDaSynth tackles the challenge of efficiently extracting structured data from multimodal scientific literature by integrating retrieval-augmented generation with interactive validation and semantic grouping. The system automatically generates structured data tables from text, tables, and figures in response to user queries, while offering visual summaries and grouping to resolve cross-document inconsistencies. In a within-subject study across nutrition and NLP domains, SciDaSynth delivered higher data quality and substantially faster data extraction than manual baselines, with users reporting streamlined workflows and effective validation tools, yet remaining cautious about AI outputs. The work highlights design considerations for human-AI collaboration in data extraction and offers a public codebase for replication.

Abstract

The explosion of scientific literature has made the efficient and accurate extraction of structured data a critical component for advancing scientific knowledge and supporting evidence-based decision-making. However, existing tools often struggle to extract and structure multimodal, varied, and inconsistent information across documents into standardized formats. We introduce SciDaSynth, a novel interactive system powered by large language models (LLMs) that automatically generates structured data tables according to users' queries by integrating information from diverse sources, including text, tables, and figures. Furthermore, SciDaSynth supports efficient table data validation and refinement, featuring multi-faceted visual summaries and semantic grouping capabilities to resolve cross-document data inconsistencies. A within-subjects study with nutrition and NLP researchers demonstrates SciDaSynth's effectiveness in producing high-quality structured data more efficiently than baseline methods. We discuss design implications for human-AI collaborative systems supporting data extraction tasks. The system code is available at https://github.com/xingbow/SciDaEx
Paper Structure (59 sections, 5 equations, 6 figures, 4 tables)

This paper contains 59 sections, 5 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: System workflow of SciDaSynth: (1) Retrieval augmented generation (RAG) based technical framework for extracting and structuring data from figures, text, and tables in scientific documents using LLMs. (2) The user interface then allows for data extraction via question-answering, data validation, correction, summarization, standardization, and database updates through an iterative refinement process.
  • Figure 2: User interface of SciDaSynth. The interfaces features: (A) A query panel for users to input natural language questions or select specific data attributes; (B) A data table displaying extracted information with highlighting of potentially problematic records; (C) Context menu options to validate data by examining relevant document snippets; (D) PDF viewer for accessing original sources; (E) Data standardization panel with multi-level and multi-faceted data summarization and standardization support.
  • Figure 3: Group standardization process. Users start with major groups' statistics within a selected data attributes. Then, they can edit individual groups by changing the group labels and removing irrelevant values. Finally, they can apply edited group results to the data table.
  • Figure 4: The data quality of using SciDaSynth, Baseline A (human), and Baseline B (automated method). SciDaSynth achieved the highest data quality scores for both Dataset I and Dataset II. *: p < 0.05, **: p <0.01.
  • Figure 5: The task completion time of using SciDaSynth and Baseline A. The pairwise comparison was significant. ***: p < 0.001.
  • ...and 1 more figures