Table of Contents
Fetching ...

Innovator: Scientific Continued Pretraining with Fine-grained MoE Upcycling

Ning Liao, Xiaoxing Wang, Zehao Lin, Weiyang Guo, Feng Hong, Shixiang Song, Geng Yu, Zihua Zhao, Sitao Xie, Longxuan Wei, Xiangqi Jin, Xiaohan Qin, Jiale Ma, Kai Chen, Jiangchao Yao, Zhouhan Lin, Junchi Yan, Zhiyu Li, Feiyu Xiong, Yanfeng Wang, Linfeng Zhang

TL;DR

Innovator addresses catastrophic forgetting during scientific continued pretraining by upcycling a dense LLM into a fine-grained Mixtures-of-Experts (MoE) with a shared general expert. A four-stage upcycle training (Scientific Expert Induction, Scientific Expert Split, Science-Aware Routing, Generalist-Scientist Integration) plus tri-level data quality control and specialized data pipelines enables discipline-specific knowledge while preserving general capability. Post-training via GRPO yields Innovator-Reason with substantial improvements in scientific reasoning. Empirical results show around 25% average improvement across 30 scientific tasks and 99% retention of general performance, with Innovator-Reason delivering further reasoning gains on complex problems.

Abstract

A large language model (LLM) with knowledge in both scientific and general tasks is the foundation of science general intelligence. However, directly continued pretraining an LLM using science data usually leads to catastrophic forgetting, which indicates severe degradation in general ability. In this report, we present Innovator, which solves this problem by upcycling a pre-trained dense LLM into a fine-grained Mixtures-of-Experts model during continued pretraining, where different experts are expected to learn science knowledge in different disciplines, and a shared expert is utilized for general tasks. Innovator introduces a four-stage upcycle training paradigm: (1) Scientific Expert Induction on discipline-specific data, (2) Fine-grained Expert Splitting via FFN dimension decomposition, (3) Science-Aware Routing warmup, and (4) Generalist-Scientist Integration training on hybrid datasets. Such a paradigm enables knowledge in the general domain, and different scientific disciplines can be decoupled, avoiding the negative influence among knowledge in different domains. With 53.3B total parameters and 13.3B activated, Innovator extends Qwen2.5-7B using a shared general expert and 64 specialized scientific experts with 8 activated. Trained on 300B tokens with tri-level quality-controlled data, Innovator achieves 25% average improvement across 30 scientific tasks with a win rate as 70%, while retaining 99% performance in general tasks. Furthermore, Innovator-Reason, which is post-trained from Innovator for reasoning boosting, exhibits excellent reasoning performance in solving complex scientific problems with improvements over 30%.

Innovator: Scientific Continued Pretraining with Fine-grained MoE Upcycling

TL;DR

Innovator addresses catastrophic forgetting during scientific continued pretraining by upcycling a dense LLM into a fine-grained Mixtures-of-Experts (MoE) with a shared general expert. A four-stage upcycle training (Scientific Expert Induction, Scientific Expert Split, Science-Aware Routing, Generalist-Scientist Integration) plus tri-level data quality control and specialized data pipelines enables discipline-specific knowledge while preserving general capability. Post-training via GRPO yields Innovator-Reason with substantial improvements in scientific reasoning. Empirical results show around 25% average improvement across 30 scientific tasks and 99% retention of general performance, with Innovator-Reason delivering further reasoning gains on complex problems.

Abstract

A large language model (LLM) with knowledge in both scientific and general tasks is the foundation of science general intelligence. However, directly continued pretraining an LLM using science data usually leads to catastrophic forgetting, which indicates severe degradation in general ability. In this report, we present Innovator, which solves this problem by upcycling a pre-trained dense LLM into a fine-grained Mixtures-of-Experts model during continued pretraining, where different experts are expected to learn science knowledge in different disciplines, and a shared expert is utilized for general tasks. Innovator introduces a four-stage upcycle training paradigm: (1) Scientific Expert Induction on discipline-specific data, (2) Fine-grained Expert Splitting via FFN dimension decomposition, (3) Science-Aware Routing warmup, and (4) Generalist-Scientist Integration training on hybrid datasets. Such a paradigm enables knowledge in the general domain, and different scientific disciplines can be decoupled, avoiding the negative influence among knowledge in different domains. With 53.3B total parameters and 13.3B activated, Innovator extends Qwen2.5-7B using a shared general expert and 64 specialized scientific experts with 8 activated. Trained on 300B tokens with tri-level quality-controlled data, Innovator achieves 25% average improvement across 30 scientific tasks with a win rate as 70%, while retaining 99% performance in general tasks. Furthermore, Innovator-Reason, which is post-trained from Innovator for reasoning boosting, exhibits excellent reasoning performance in solving complex scientific problems with improvements over 30%.

Paper Structure

This paper contains 24 sections, 7 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 1: The overview of the proposed framework for data preprocessing. It covers distinct procedures tailored for general pre-training, scientific pre-training, and post-training phases.
  • Figure 2: The framework of the Innovator. It is upcyle trained from the Qwen2.5-7B Qwen2.5 dense model via the proposed novel four-stage training paradigm, including Scientific Expert Induction, Scientific Expert Split, Science-Aware Routing, and Generalist-Scientist Integration.
  • Figure 3: The general and scientific performance of the Innovator with training data scaling up.