KALE-LM-Chem: Vision and Practice Toward an AI Brain for Chemistry
Weichen Dai, Yezeng Chen, Zijie Dai, Yubo Liu, Zhijie Huang, Yixuan Pan, Baiyang Song, Chengli Zhong, Xinhe Li, Zeyu Wang, Zhuoying Feng, Yi Zhou
TL;DR
The paper argues for a chemistry-focused AI brain built on four core capabilities: information extraction, semantic parsing, knowledge-based QA, and reasoning & planning. It introduces KALE-LM-Chem, a knowledge-and-logic enhanced large language model trained via a two-phase pipeline (domain-specific continual pretraining and supervised fine-tuning) on a diversified chemistry corpus. Evaluation on ChemBench and MOF information extraction shows that KALE-LM-Chem-1.5 achieves state-of-the-art or strong performance, outperforming several baselines including GPT-3.5 and GPT-4o-mini on key tasks. By integrating domain knowledge and logical reasoning with large language models, the work demonstrates a pathway toward a practical AI brain for accelerating chemical discovery and automation, with future work aimed at strengthening logic augmentation.
Abstract
Recent advancements in large language models (LLMs) have demonstrated strong potential for enabling domain-specific intelligence. In this work, we present our vision for building an AI-powered chemical brain, which frames chemical intelligence around four core capabilities: information extraction, semantic parsing, knowledge-based QA, and reasoning & planning. We argue that domain knowledge and logic are essential pillars for enabling such a system to assist and accelerate scientific discovery. To initiate this effort, we introduce our first generation of large language models for chemistry: KALE-LM-Chem and KALE-LM-Chem-1.5, which have achieved outstanding performance in tasks related to the field of chemistry. We hope that our work serves as a strong starting point, helping to realize more intelligent AI and promoting the advancement of human science and technology, as well as societal development.
