Table of Contents
Fetching ...

OmniScience: A Domain-Specialized LLM for Scientific Reasoning and Discovery

Vignesh Prabhakar, Md Amirul Islam, Adam Atanas, Yao-Ting Wang, Joah Han, Aastha Jhunjhunwala, Rucha Apte, Robert Clark, Kang Xu, Zihan Wang, Kai Liu

TL;DR

OmniScience introduces a compute-efficient, domain-adaptive LLM for scientific reasoning by combining domain adaptive pretraining on a broad scientific corpus with supervised fine-tuning and a high-quality reasoning distillation stage. Trained on a LLaMA-3.1 $70$B foundation, it attains state-of-the-art performance among similarly sized models on GPQA Diamond ($\approx 0.72$) and demonstrates strong domain capabilities in battery science, including a dual-agent RAG framework for molecular screening. Ablation results show that both domain adaptation and reasoning-focused distillation are essential to achieve peak performance. The work also presents a practical battery agent that ranks solvent molecules with superior accuracy, illustrating the model’s potential to accelerate domain-specific discovery at reduced computational cost.

Abstract

Large Language Models (LLMs) have demonstrated remarkable potential in advancing scientific knowledge and addressing complex challenges. In this work, we introduce OmniScience, a specialized large reasoning model for general science, developed through three key components: (1) domain adaptive pretraining on a carefully curated corpus of scientific literature, (2) instruction tuning on a specialized dataset to guide the model in following domain-specific tasks, and (3) reasoning-based knowledge distillation through fine-tuning to significantly enhance its ability to generate contextually relevant and logically sound responses. We demonstrate the versatility of OmniScience by developing a battery agent that efficiently ranks molecules as potential electrolyte solvents or additives. Comprehensive evaluations reveal that OmniScience is competitive with state-of-the-art large reasoning models on the GPQA Diamond and domain-specific battery benchmarks, while outperforming all public reasoning and non-reasoning models with similar parameter counts. We further demonstrate via ablation experiments that domain adaptive pretraining and reasoning-based knowledge distillation are critical to attain our performance levels, across benchmarks.

OmniScience: A Domain-Specialized LLM for Scientific Reasoning and Discovery

TL;DR

OmniScience introduces a compute-efficient, domain-adaptive LLM for scientific reasoning by combining domain adaptive pretraining on a broad scientific corpus with supervised fine-tuning and a high-quality reasoning distillation stage. Trained on a LLaMA-3.1 B foundation, it attains state-of-the-art performance among similarly sized models on GPQA Diamond () and demonstrates strong domain capabilities in battery science, including a dual-agent RAG framework for molecular screening. Ablation results show that both domain adaptation and reasoning-focused distillation are essential to achieve peak performance. The work also presents a practical battery agent that ranks solvent molecules with superior accuracy, illustrating the model’s potential to accelerate domain-specific discovery at reduced computational cost.

Abstract

Large Language Models (LLMs) have demonstrated remarkable potential in advancing scientific knowledge and addressing complex challenges. In this work, we introduce OmniScience, a specialized large reasoning model for general science, developed through three key components: (1) domain adaptive pretraining on a carefully curated corpus of scientific literature, (2) instruction tuning on a specialized dataset to guide the model in following domain-specific tasks, and (3) reasoning-based knowledge distillation through fine-tuning to significantly enhance its ability to generate contextually relevant and logically sound responses. We demonstrate the versatility of OmniScience by developing a battery agent that efficiently ranks molecules as potential electrolyte solvents or additives. Comprehensive evaluations reveal that OmniScience is competitive with state-of-the-art large reasoning models on the GPQA Diamond and domain-specific battery benchmarks, while outperforming all public reasoning and non-reasoning models with similar parameter counts. We further demonstrate via ablation experiments that domain adaptive pretraining and reasoning-based knowledge distillation are critical to attain our performance levels, across benchmarks.

Paper Structure

This paper contains 17 sections, 42 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: Illustration of our OmniScience training pipeline. We begin with a LLaMA 3.1 70B foundation model, apply domain adaptive pretraining to obtain the OmniScience base model, and then perform model alignment and reasoning-based knowledge distillation to produce the final OmniScience Reasoning model.
  • Figure 1: Performance comparison between different SOTA reasoning models on the GPQA Diamond benchmark.
  • Figure 2: Comparison of GPQA Diamond scores with top 10-100B parameter models. Our model outperforms all the baselines including DeepSeek-R1 distill variants.
  • Figure 3: Bar chart visualizing the performance of various LLMs on MMLU, Winogrande, Hellaswag, and ARC-E. OmniScience Reasoning consistently matches or exceeds stronger proprietary models across benchmarks.
  • Figure 4: Bar chart visualization of battery-specific task performance for various LLMs. This figure corresponds to Table \ref{['tab:domain']} and highlights comparative accuracy across Q/A, MCQ, reading comprehension, summarization, and reasoning.
  • ...and 9 more figures