Table of Contents
Fetching ...

ChemDFM-X: Towards Large Multimodal Model for Chemistry

Zihan Zhao, Bo Chen, Jingpiao Li, Lu Chen, Liyang Wen, Pengyu Wang, Zichen Zhu, Danyang Zhang, Ziping Wan, Yansi Li, Zhongyang Dai, Xin Chen, Kai Yu

TL;DR

ChemDFM-X addresses the challenge of building a practical cross-modal foundation for chemistry by integrating five non-text modalities (graph, conformation, image, MS2, IR) with a shared decoder. It adopts a decoder-plus-encoders architecture, freezing a pre-trained chemical LLM (ChemDFM) while training modality-specific encoders and projection layers, enabled by a 7.6M instruction-tuning dataset derived from 1.3M seed SMILES. Across structural, image, and characterization tasks, ChemDFM-X outperforms generalist large multimodal models and demonstrates clear cross-modality advantages, particularly in reaction-oriented tasks and spectrum-assisted reasoning. This work advances chemical general intelligence by enabling cohesive, multi-modal understanding and reasoning with a single model across diverse data types and tasks, highlighting practical potential for CGIs in chemistry.

Abstract

Rapid developments of AI tools are expected to offer unprecedented assistance to the research of natural science including chemistry. However, neither existing unimodal task-specific specialist models nor emerging general large multimodal models (LMM) can cover the wide range of chemical data modality and task categories. To address the real demands of chemists, a cross-modal Chemical General Intelligence (CGI) system, which serves as a truly practical and useful research assistant utilizing the great potential of LMMs, is in great need. In this work, we introduce the first Cross-modal Dialogue Foundation Model for Chemistry (ChemDFM-X). Diverse multimodal data are generated from an initial modality by approximate calculations and task-specific model predictions. This strategy creates sufficient chemical training corpora, while significantly reducing excessive expense, resulting in an instruction-tuning dataset containing 7.6M data. After instruction finetuning, ChemDFM-X is evaluated on extensive experiments of different chemical tasks with various data modalities. The results demonstrate the capacity of ChemDFM-X for multimodal and inter-modal knowledge comprehension. ChemDFM-X marks a significant milestone toward aligning all modalities in chemistry, a step closer to CGI.

ChemDFM-X: Towards Large Multimodal Model for Chemistry

TL;DR

ChemDFM-X addresses the challenge of building a practical cross-modal foundation for chemistry by integrating five non-text modalities (graph, conformation, image, MS2, IR) with a shared decoder. It adopts a decoder-plus-encoders architecture, freezing a pre-trained chemical LLM (ChemDFM) while training modality-specific encoders and projection layers, enabled by a 7.6M instruction-tuning dataset derived from 1.3M seed SMILES. Across structural, image, and characterization tasks, ChemDFM-X outperforms generalist large multimodal models and demonstrates clear cross-modality advantages, particularly in reaction-oriented tasks and spectrum-assisted reasoning. This work advances chemical general intelligence by enabling cohesive, multi-modal understanding and reasoning with a single model across diverse data types and tasks, highlighting practical potential for CGIs in chemistry.

Abstract

Rapid developments of AI tools are expected to offer unprecedented assistance to the research of natural science including chemistry. However, neither existing unimodal task-specific specialist models nor emerging general large multimodal models (LMM) can cover the wide range of chemical data modality and task categories. To address the real demands of chemists, a cross-modal Chemical General Intelligence (CGI) system, which serves as a truly practical and useful research assistant utilizing the great potential of LMMs, is in great need. In this work, we introduce the first Cross-modal Dialogue Foundation Model for Chemistry (ChemDFM-X). Diverse multimodal data are generated from an initial modality by approximate calculations and task-specific model predictions. This strategy creates sufficient chemical training corpora, while significantly reducing excessive expense, resulting in an instruction-tuning dataset containing 7.6M data. After instruction finetuning, ChemDFM-X is evaluated on extensive experiments of different chemical tasks with various data modalities. The results demonstrate the capacity of ChemDFM-X for multimodal and inter-modal knowledge comprehension. ChemDFM-X marks a significant milestone toward aligning all modalities in chemistry, a step closer to CGI.
Paper Structure (36 sections, 7 figures, 11 tables)

This paper contains 36 sections, 7 figures, 11 tables.

Figures (7)

  • Figure 1: The overview of ChemDFM-X. The different modalities involved in chemical tasks are distinguished by colors. Structural modalities are marked with blue, figures are marked with orange, and spectra are marked with green. The text modality is marked with purple in the input part and omitted in the output part. The dialogue-based free-form human-AI collaboration may involve any feasible modalities and is marked with gray. For a detailed introduction to these modalities, please refer to Section \ref{['sec:overview']}.
  • Figure 2: Overview of ChemDFM-X model structure and training paradigm. The colors to mark different input modalities are aligned with Figure 1 in the main text.
  • Figure 3: An example of the final structure of instruction tuning data.
  • Figure 4: Examples of three molecular styles.
  • Figure 5: Evaluation tasks for structural modalities
  • ...and 2 more figures