Table of Contents
Fetching ...

ByteScience: Bridging Unstructured Scientific Literature and Structured Data with Auto Fine-tuned Large Language Model in Token Granularity

Tong Xie, Hanzhi Zhang, Shaozhou Wang, Yuwei Wan, Imran Razzak, Chunyu Kit, Wenjie Zhang, Bram Hoex

TL;DR

ByteScience tackles the challenge of converting unstructured scientific literature into structured data by fine-tuning a domain-specific LLM (DARWIN) on AWS. It introduces a two-phase Green/Blue pipeline enabling rapid corpus construction and targeted model adaptation with minimal annotations and zero-code tooling. Empirical results show strong extraction performance across NER, RE, and ER tasks and substantial reductions in annotation time. The approach promises scalable, cross-disciplinary scientific data extraction and accelerated discovery.

Abstract

Natural Language Processing (NLP) is widely used to supply summarization ability from long context to structured information. However, extracting structured knowledge from scientific text by NLP models remains a challenge because of its domain-specific nature to complex data preprocessing and the granularity of multi-layered device-level information. To address this, we introduce ByteScience, a non-profit cloud-based auto fine-tuned Large Language Model (LLM) platform, which is designed to extract structured scientific data and synthesize new scientific knowledge from vast scientific corpora. The platform capitalizes on DARWIN, an open-source, fine-tuned LLM dedicated to natural science. The platform was built on Amazon Web Services (AWS) and provides an automated, user-friendly workflow for custom model development and data extraction. The platform achieves remarkable accuracy with only a small amount of well-annotated articles. This innovative tool streamlines the transition from the science literature to structured knowledge and data and benefits the advancements in natural informatics.

ByteScience: Bridging Unstructured Scientific Literature and Structured Data with Auto Fine-tuned Large Language Model in Token Granularity

TL;DR

ByteScience tackles the challenge of converting unstructured scientific literature into structured data by fine-tuning a domain-specific LLM (DARWIN) on AWS. It introduces a two-phase Green/Blue pipeline enabling rapid corpus construction and targeted model adaptation with minimal annotations and zero-code tooling. Empirical results show strong extraction performance across NER, RE, and ER tasks and substantial reductions in annotation time. The approach promises scalable, cross-disciplinary scientific data extraction and accelerated discovery.

Abstract

Natural Language Processing (NLP) is widely used to supply summarization ability from long context to structured information. However, extracting structured knowledge from scientific text by NLP models remains a challenge because of its domain-specific nature to complex data preprocessing and the granularity of multi-layered device-level information. To address this, we introduce ByteScience, a non-profit cloud-based auto fine-tuned Large Language Model (LLM) platform, which is designed to extract structured scientific data and synthesize new scientific knowledge from vast scientific corpora. The platform capitalizes on DARWIN, an open-source, fine-tuned LLM dedicated to natural science. The platform was built on Amazon Web Services (AWS) and provides an automated, user-friendly workflow for custom model development and data extraction. The platform achieves remarkable accuracy with only a small amount of well-annotated articles. This innovative tool streamlines the transition from the science literature to structured knowledge and data and benefits the advancements in natural informatics.

Paper Structure

This paper contains 13 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: ByteScience Pipeline. The initial setup for a specific field involves constructing a domain-specific corpus of structured scientific data (Green Pipeline) and fine-tuning an LLM on this dataset to optimize performance for the target scientific domain (Blue Pipeline). Once this setup is complete, users can efficiently generate structured datasets from new scientific documents in the same field by utilizing the fine-tuned LLM stored in AWS.
  • Figure 2: The architecture of ByteScience creates a structured database on AWS cloud with LLM.
  • Figure 3: Screenshot of label setup.
  • Figure 4: Screenshot of labeling page.
  • Figure 5: Screenshot of extraction results of a paper.