Table of Contents
Fetching ...

DARWIN Series: Domain Specific Large Language Models for Natural Science

Tong Xie, Yuwei Wan, Wei Huang, Zhenyu Yin, Yixuan Liu, Shaozhou Wang, Qingyuan Linghu, Chunyu Kit, Clara Grazian, Wenjie Zhang, Imran Razzak, Bram Hoex

TL;DR

DARWIN presents a suite of domain-specific, open-source LLMs tailored for natural science by injecting structured and unstructured scientific knowledge through SIG-based instruction generation and multi-task training. The approach leverages SciQ and FAIR datasets to train three models—SIG, DARWIN-BASE, and DARWIN-MDP—achieving state-of-the-art results on scientific QA and competitive performance on materials and device prediction tasks while reducing reliance on closed-source systems. Key contributions include automatic instruction generation from scientific texts, a reproducible open-data pipeline, and demonstrated gains from multi-task learning across diverse scientific tasks. The work highlights the potential of open, knowledge-infused LLMs to accelerate scientific discovery within chemistry, physics, and materials science, while outlining data and evaluation limitations and directions for scaling and broader validation.

Abstract

Emerging tools bring forth fresh approaches to work, and the field of natural science is no different. In natural science, traditional manual, serial, and labour-intensive work is being augmented by automated, parallel, and iterative processes driven by artificial intelligence-based experimental automation and more. To add new capabilities in natural science, enabling the acceleration and enrichment of automation of the discovery process, we present DARWIN, a series of tailored LLMs for natural science, mainly in physics, chemistry, and material science. This series relies on open-source LLM, incorporating structured and unstructured scientific knowledge from public datasets and literature. We fine-tuned the models using over 60,000 instruction data points, emphasizing factual correctness. During the fine-tuning, we introduce the Scientific Instruction Generation (SIG) model, automating instruction generation from scientific texts. This eliminates the need for manual extraction or domain-specific knowledge graphs and efficiently injects scientific knowledge into the model. We also explore multi-task training strategies, revealing interconnections between scientific tasks. DARWIN series not only achieves state-of-the-art results on various scientific tasks but also diminishes reliance on closed-source AI models. Our research showcases the ability of LLM in the scientific domain, with the overarching goal of fostering prosperity within the broader AI for science community.

DARWIN Series: Domain Specific Large Language Models for Natural Science

TL;DR

DARWIN presents a suite of domain-specific, open-source LLMs tailored for natural science by injecting structured and unstructured scientific knowledge through SIG-based instruction generation and multi-task training. The approach leverages SciQ and FAIR datasets to train three models—SIG, DARWIN-BASE, and DARWIN-MDP—achieving state-of-the-art results on scientific QA and competitive performance on materials and device prediction tasks while reducing reliance on closed-source systems. Key contributions include automatic instruction generation from scientific texts, a reproducible open-data pipeline, and demonstrated gains from multi-task learning across diverse scientific tasks. The work highlights the potential of open, knowledge-infused LLMs to accelerate scientific discovery within chemistry, physics, and materials science, while outlining data and evaluation limitations and directions for scaling and broader validation.

Abstract

Emerging tools bring forth fresh approaches to work, and the field of natural science is no different. In natural science, traditional manual, serial, and labour-intensive work is being augmented by automated, parallel, and iterative processes driven by artificial intelligence-based experimental automation and more. To add new capabilities in natural science, enabling the acceleration and enrichment of automation of the discovery process, we present DARWIN, a series of tailored LLMs for natural science, mainly in physics, chemistry, and material science. This series relies on open-source LLM, incorporating structured and unstructured scientific knowledge from public datasets and literature. We fine-tuned the models using over 60,000 instruction data points, emphasizing factual correctness. During the fine-tuning, we introduce the Scientific Instruction Generation (SIG) model, automating instruction generation from scientific texts. This eliminates the need for manual extraction or domain-specific knowledge graphs and efficiently injects scientific knowledge into the model. We also explore multi-task training strategies, revealing interconnections between scientific tasks. DARWIN series not only achieves state-of-the-art results on various scientific tasks but also diminishes reliance on closed-source AI models. Our research showcases the ability of LLM in the scientific domain, with the overarching goal of fostering prosperity within the broader AI for science community.
Paper Structure (28 sections, 2 equations, 8 figures, 4 tables)

This paper contains 28 sections, 2 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: DARWIN vs GPT-4 comparative analysis in natural science tasks
  • Figure 2: Composition of scientific paper dataset
  • Figure 3: Composition of FAIR dataset
  • Figure 5: Darwin-SIG structure and comparison
  • Figure 6: ESOL task instruction generation structure, A solid line delineates a one-to-one correspondence
  • ...and 3 more figures