ChemDFM-X: Towards Large Multimodal Model for Chemistry
Zihan Zhao, Bo Chen, Jingpiao Li, Lu Chen, Liyang Wen, Pengyu Wang, Zichen Zhu, Danyang Zhang, Ziping Wan, Yansi Li, Zhongyang Dai, Xin Chen, Kai Yu
TL;DR
ChemDFM-X addresses the challenge of building a practical cross-modal foundation for chemistry by integrating five non-text modalities (graph, conformation, image, MS2, IR) with a shared decoder. It adopts a decoder-plus-encoders architecture, freezing a pre-trained chemical LLM (ChemDFM) while training modality-specific encoders and projection layers, enabled by a 7.6M instruction-tuning dataset derived from 1.3M seed SMILES. Across structural, image, and characterization tasks, ChemDFM-X outperforms generalist large multimodal models and demonstrates clear cross-modality advantages, particularly in reaction-oriented tasks and spectrum-assisted reasoning. This work advances chemical general intelligence by enabling cohesive, multi-modal understanding and reasoning with a single model across diverse data types and tasks, highlighting practical potential for CGIs in chemistry.
Abstract
Rapid developments of AI tools are expected to offer unprecedented assistance to the research of natural science including chemistry. However, neither existing unimodal task-specific specialist models nor emerging general large multimodal models (LMM) can cover the wide range of chemical data modality and task categories. To address the real demands of chemists, a cross-modal Chemical General Intelligence (CGI) system, which serves as a truly practical and useful research assistant utilizing the great potential of LMMs, is in great need. In this work, we introduce the first Cross-modal Dialogue Foundation Model for Chemistry (ChemDFM-X). Diverse multimodal data are generated from an initial modality by approximate calculations and task-specific model predictions. This strategy creates sufficient chemical training corpora, while significantly reducing excessive expense, resulting in an instruction-tuning dataset containing 7.6M data. After instruction finetuning, ChemDFM-X is evaluated on extensive experiments of different chemical tasks with various data modalities. The results demonstrate the capacity of ChemDFM-X for multimodal and inter-modal knowledge comprehension. ChemDFM-X marks a significant milestone toward aligning all modalities in chemistry, a step closer to CGI.
