Leveraging Large Language Models for Automated Scalable Development of Open Scientific Databases

Nikita Gautam; Doina Caragea; Ignacio Ciampitti; Federico Gomez

Leveraging Large Language Models for Automated Scalable Development of Open Scientific Databases

Nikita Gautam, Doina Caragea, Ignacio Ciampitti, Federico Gomez

TL;DR

A web-based tool that leverages Large Language Models (LLMs) for automated and scalable development of open scientific databases that combines keyword-based querying, API-enabled data retrieval, and LLM-powered text classification to construct domain-specific scientific databases is introduced.

Abstract

With the exponential increase in online scientific literature, identifying reliable domain-specific data has become increasingly important but also very challenging. Manual data collection and filtering for domain-specific scientific literature is not only time-consuming but also labor-intensive and prone to errors and inconsistencies. To facilitate automated data collection, the paper introduces a web-based tool that leverages Large Language Models (LLMs) for automated and scalable development of open scientific databases. More specifically, the tool is based on an automated and unified framework that combines keyword-based querying, API-enabled data retrieval, and LLM-powered text classification to construct domain-specific scientific databases. Data is collected from multiple reliable data sources and search engines using a parallel querying technique to construct a combined unified dataset. The dataset is subsequently filtered using LLMs queried with prompts tailored for each keyword-based query to extract the relevant data to a scientific query of interest. The approach was tested across a set of variable keyword-based searches for different domain-specific tasks related to agriculture and crop yield. The results and analysis show 90\% overlap with small domain expert-curated databases, suggesting that the proposed tool can be used to significantly reduce manual workload. Furthermore, the proposed framework is both scalable and domain-agnostic and can be applied across diverse fields for building scalable open scientific databases.

Leveraging Large Language Models for Automated Scalable Development of Open Scientific Databases

TL;DR

Abstract

Paper Structure (12 sections, 1 equation, 4 figures, 3 tables)

This paper contains 12 sections, 1 equation, 4 figures, 3 tables.

Introduction
Background
Challenges in Assembling Agricultural Scientific Data
NLP and LLMs for Automated Literature Screening
Methods
Data Collection
Data Filtering
Data Classification Using LLMs
Results
Zero-Shot Classification using LLMs
Abstract Filtering Tool
Conclusion and Future Work

Figures (4)

Figure 1: The figure illustrates the end-to-end pipeline for assembling a domain-specific dataset from web-based literature sources. The process begins with a set of domain-relevant keywords, which are used to query web data sources or search engines. Retrieved documents are collected by a data scraper engine, aggregating abstracts and metadata into a combined dataset from multiple sources. This raw dataset is passed through a data filtering engine, which leverages large language models (LLMs) to classify content into relevant data and noisy data. The pipeline automates and scales the screening process, enabling efficient construction of clean, task-specific datasets.
Figure 2: Prompt used for zero-shot classification using LLM models
Figure 3: Dashboard of Abstract Filtering Tool
Figure 4: Data Collection Page of Abstract Filtering Tool

Leveraging Large Language Models for Automated Scalable Development of Open Scientific Databases

TL;DR

Abstract

Leveraging Large Language Models for Automated Scalable Development of Open Scientific Databases

Authors

TL;DR

Abstract

Table of Contents

Figures (4)