Table of Contents
Fetching ...

Intern-S1: A Scientific Multimodal Foundation Model

Lei Bai, Zhongrui Cai, Yuhang Cao, Maosong Cao, Weihan Cao, Chiyu Chen, Haojiong Chen, Kai Chen, Pengcheng Chen, Ying Chen, Yongkang Chen, Yu Cheng, Pei Chu, Tao Chu, Erfei Cui, Ganqu Cui, Long Cui, Ziyun Cui, Nianchen Deng, Ning Ding, Nanqing Dong, Peijie Dong, Shihan Dou, Sinan Du, Haodong Duan, Caihua Fan, Ben Gao, Changjiang Gao, Jianfei Gao, Songyang Gao, Yang Gao, Zhangwei Gao, Jiaye Ge, Qiming Ge, Lixin Gu, Yuzhe Gu, Aijia Guo, Qipeng Guo, Xu Guo, Conghui He, Junjun He, Yili Hong, Siyuan Hou, Caiyu Hu, Hanglei Hu, Jucheng Hu, Ming Hu, Zhouqi Hua, Haian Huang, Junhao Huang, Xu Huang, Zixian Huang, Zhe Jiang, Lingkai Kong, Linyang Li, Peiji Li, Pengze Li, Shuaibin Li, Tianbin Li, Wei Li, Yuqiang Li, Dahua Lin, Junyao Lin, Tianyi Lin, Zhishan Lin, Hongwei Liu, Jiangning Liu, Jiyao Liu, Junnan Liu, Kai Liu, Kaiwen Liu, Kuikun Liu, Shichun Liu, Shudong Liu, Wei Liu, Xinyao Liu, Yuhong Liu, Zhan Liu, Yinquan Lu, Haijun Lv, Hongxia Lv, Huijie Lv, Qitan Lv, Ying Lv, Chengqi Lyu, Chenglong Ma, Jianpeng Ma, Ren Ma, Runmin Ma, Runyuan Ma, Xinzhu Ma, Yichuan Ma, Zihan Ma, Sixuan Mi, Junzhi Ning, Wenchang Ning, Xinle Pang, Jiahui Peng, Runyu Peng, Yu Qiao, Jiantao Qiu, Xiaoye Qu, Yuan Qu, Yuchen Ren, Fukai Shang, Wenqi Shao, Junhao Shen, Shuaike Shen, Chunfeng Song, Demin Song, Diping Song, Chenlin Su, Weijie Su, Weigao Sun, Yu Sun, Qian Tan, Cheng Tang, Huanze Tang, Kexian Tang, Shixiang Tang, Jian Tong, Aoran Wang, Bin Wang, Dong Wang, Lintao Wang, Rui Wang, Weiyun Wang, Wenhai Wang, Jiaqi Wang, Yi Wang, Ziyi Wang, Ling-I Wu, Wen Wu, Yue Wu, Zijian Wu, Linchen Xiao, Shuhao Xing, Chao Xu, Huihui Xu, Jun Xu, Ruiliang Xu, Wanghan Xu, GanLin Yang, Yuming Yang, Haochen Ye, Jin Ye, Shenglong Ye, Jia Yu, Jiashuo Yu, Jing Yu, Fei Yuan, Yuhang Zang, Bo Zhang, Chao Zhang, Chen Zhang, Hongjie Zhang, Jin Zhang, Qiaosheng Zhang, Qiuyinzhe Zhang, Songyang Zhang, Taolin Zhang, Wenlong Zhang, Wenwei Zhang, Yechen Zhang, Ziyang Zhang, Haiteng Zhao, Qian Zhao, Xiangyu Zhao, Xiangyu Zhao, Bowen Zhou, Dongzhan Zhou, Peiheng Zhou, Yuhao Zhou, Yunhua Zhou, Dongsheng Zhu, Lin Zhu, Yicheng Zou

TL;DR

Intern-S1 addresses the gap between open-source and closed-source models in high-value scientific domains by building a large multimodal Mixture-of-Experts foundation model trained on extensive scientific data and guided by a Mixture-of-Rewards reinforcement learning framework. The architecture combines a MoE LLM with modality-specific encoders (vision, time-series) and a dynamic tokenizer, enabling efficient processing of images, text, and scientific data. The authors demonstrate state-of-the-art open-source performance on general reasoning benchmarks and strong superiority in scientific reasoning across both text-only and multimodal tasks, with data-efficient RL reducing training cost. They release model weights and tooling to catalyze future research.

Abstract

In recent years, a plethora of open-source foundation models have emerged, achieving remarkable progress in some widely attended fields, with performance being quite close to that of closed-source models. However, in high-value but more challenging scientific professional fields, either the fields still rely on expert models, or the progress of general foundation models lags significantly compared to those in popular areas, far from sufficient for transforming scientific research and leaving substantial gap between open-source models and closed-source models in these scientific domains. To mitigate this gap and explore a step further toward Artificial General Intelligence (AGI), we introduce Intern-S1, a specialized generalist equipped with general understanding and reasoning capabilities with expertise to analyze multiple science modal data. Intern-S1 is a multimodal Mixture-of-Experts (MoE) model with 28 billion activated parameters and 241 billion total parameters, continually pre-trained on 5T tokens, including over 2.5T tokens from scientific domains. In the post-training stage, Intern-S1 undergoes offline and then online reinforcement learning (RL) in InternBootCamp, where we propose Mixture-of-Rewards (MoR) to synergize the RL training on more than 1000 tasks simultaneously. Through integrated innovations in algorithms, data, and training systems, Intern-S1 achieved top-tier performance in online RL training. On comprehensive evaluation benchmarks, Intern-S1 demonstrates competitive performance on general reasoning tasks among open-source models and significantly outperforms open-source models in scientific domains, surpassing closed-source state-of-the-art models in professional tasks, such as molecular synthesis planning, reaction condition prediction, predicting thermodynamic stabilities for crystals. Our models are available at https://huggingface.co/internlm/Intern-S1.

Intern-S1: A Scientific Multimodal Foundation Model

TL;DR

Intern-S1 addresses the gap between open-source and closed-source models in high-value scientific domains by building a large multimodal Mixture-of-Experts foundation model trained on extensive scientific data and guided by a Mixture-of-Rewards reinforcement learning framework. The architecture combines a MoE LLM with modality-specific encoders (vision, time-series) and a dynamic tokenizer, enabling efficient processing of images, text, and scientific data. The authors demonstrate state-of-the-art open-source performance on general reasoning benchmarks and strong superiority in scientific reasoning across both text-only and multimodal tasks, with data-efficient RL reducing training cost. They release model weights and tooling to catalyze future research.

Abstract

In recent years, a plethora of open-source foundation models have emerged, achieving remarkable progress in some widely attended fields, with performance being quite close to that of closed-source models. However, in high-value but more challenging scientific professional fields, either the fields still rely on expert models, or the progress of general foundation models lags significantly compared to those in popular areas, far from sufficient for transforming scientific research and leaving substantial gap between open-source models and closed-source models in these scientific domains. To mitigate this gap and explore a step further toward Artificial General Intelligence (AGI), we introduce Intern-S1, a specialized generalist equipped with general understanding and reasoning capabilities with expertise to analyze multiple science modal data. Intern-S1 is a multimodal Mixture-of-Experts (MoE) model with 28 billion activated parameters and 241 billion total parameters, continually pre-trained on 5T tokens, including over 2.5T tokens from scientific domains. In the post-training stage, Intern-S1 undergoes offline and then online reinforcement learning (RL) in InternBootCamp, where we propose Mixture-of-Rewards (MoR) to synergize the RL training on more than 1000 tasks simultaneously. Through integrated innovations in algorithms, data, and training systems, Intern-S1 achieved top-tier performance in online RL training. On comprehensive evaluation benchmarks, Intern-S1 demonstrates competitive performance on general reasoning tasks among open-source models and significantly outperforms open-source models in scientific domains, surpassing closed-source state-of-the-art models in professional tasks, such as molecular synthesis planning, reaction condition prediction, predicting thermodynamic stabilities for crystals. Our models are available at https://huggingface.co/internlm/Intern-S1.

Paper Structure

This paper contains 70 sections, 8 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Performance comparison among open-source and close-source models on Image-text and Text-only Benchmarks. Results demonstrate that Intern-S1 has a top-tier general reasoning capability among open-source models and outperforms closed-source models in scientific domains. General benchmarks: MMLU-Pro (text-only), GPQA (text-only), AIME2025 (text-only), MMMU, MMStar Science benchmarks: SmolInstruct (text-only), ChemBech (text-only), MatBench (text-only), SFE, Physics
  • Figure 2: Performance trend of LLMs across popular and low-resource (science) tasks. The X-axis is the average of three popular general benchmarks, MMLU-Pro, GPQA, AIME2025. The Y-axis is the average of three benchmarks in science domain, SmolInstruct, ChemBench, MatBench. Although the top-tier open-source LLMs raised their performance on popular tasks rapidly, their performance on science tasks does not increase.
  • Figure 3: Architecture of Intern-S1, consisting of a MoE LLM with a vision encoder, a time-series encoder, and a dynamic tokenizer that switches the tokenization and embedding strategies for natural language and scientific inputs. The Intern-S1 is equipped with the InternViT-6B, and the Intern-S1-mini is equipped with the InternViT-300M for the consideration of efficiency.
  • Figure 4: Left: The workflow of the dynamic tokenizer. The tokenizer will first detect the patterns in the input string using a rule-based detector or user-annotated special tags. Then, it will segment the input string into different parts. Each part will be tokenized using different strategies, and its embedding space will be orthogonal to each other. Finally, those vectors will be concatenated as a regular transformer input. Right: The compression ratio of different tokenizers on scientific data (SMILES format). Intern-S1 outperforms others over 70%, meaning that the Intern-S1 represents the scientific data with much fewer tokens, saving the computation overhead.
  • Figure 5: There are four stages for training Intern-S1, and only the first stage is training in the single modality.
  • ...and 9 more figures