Table of Contents
Fetching ...

ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area

Junxian Li, Di Zhang, Xunzhi Wang, Zeying Hao, Jingdi Lei, Qian Tan, Cai Zhou, Wei Liu, Yaotian Yang, Xinrui Xiong, Weiyun Wang, Zhe Chen, Wenhai Wang, Wei Li, Shufei Zhang, Mao Su, Wanli Ouyang, Yuqiang Li, Dongzhan Zhou

TL;DR

ChemVLM introduces a domain-specific multimodal large language model for chemistry by integrating a Vision Transformer-based encoder with a chemistry-aware LLM in a ViT-MLP-LLM framework. It employs a two-stage training strategy and a curated data suite (ChemOCR, MMCR-Bench, MMChemBench) to enable robust image-text reasoning across OCR, multimodal reasoning, and molecule understanding. The model achieves competitive and state-of-the-art performance across OCR, MMCR, and multimodal molecule understanding benchmarks, and demonstrates cross-domain generalization with open-source availability. This work advances practical multimodal reasoning in chemistry and provides a resource for future domain-specific multimodal AI research.

Abstract

Large Language Models (LLMs) have achieved remarkable success and have been applied across various scientific fields, including chemistry. However, many chemical tasks require the processing of visual information, which cannot be successfully handled by existing chemical LLMs. This brings a growing need for models capable of integrating multimodal information in the chemical domain. In this paper, we introduce \textbf{ChemVLM}, an open-source chemical multimodal large language model specifically designed for chemical applications. ChemVLM is trained on a carefully curated bilingual multimodal dataset that enhances its ability to understand both textual and visual chemical information, including molecular structures, reactions, and chemistry examination questions. We develop three datasets for comprehensive evaluation, tailored to Chemical Optical Character Recognition (OCR), Multimodal Chemical Reasoning (MMCR), and Multimodal Molecule Understanding tasks. We benchmark ChemVLM against a range of open-source and proprietary multimodal large language models on various tasks. Experimental results demonstrate that ChemVLM achieves competitive performance across all evaluated tasks. Our model can be found at https://huggingface.co/AI4Chem/ChemVLM-26B.

ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area

TL;DR

ChemVLM introduces a domain-specific multimodal large language model for chemistry by integrating a Vision Transformer-based encoder with a chemistry-aware LLM in a ViT-MLP-LLM framework. It employs a two-stage training strategy and a curated data suite (ChemOCR, MMCR-Bench, MMChemBench) to enable robust image-text reasoning across OCR, multimodal reasoning, and molecule understanding. The model achieves competitive and state-of-the-art performance across OCR, MMCR, and multimodal molecule understanding benchmarks, and demonstrates cross-domain generalization with open-source availability. This work advances practical multimodal reasoning in chemistry and provides a resource for future domain-specific multimodal AI research.

Abstract

Large Language Models (LLMs) have achieved remarkable success and have been applied across various scientific fields, including chemistry. However, many chemical tasks require the processing of visual information, which cannot be successfully handled by existing chemical LLMs. This brings a growing need for models capable of integrating multimodal information in the chemical domain. In this paper, we introduce \textbf{ChemVLM}, an open-source chemical multimodal large language model specifically designed for chemical applications. ChemVLM is trained on a carefully curated bilingual multimodal dataset that enhances its ability to understand both textual and visual chemical information, including molecular structures, reactions, and chemistry examination questions. We develop three datasets for comprehensive evaluation, tailored to Chemical Optical Character Recognition (OCR), Multimodal Chemical Reasoning (MMCR), and Multimodal Molecule Understanding tasks. We benchmark ChemVLM against a range of open-source and proprietary multimodal large language models on various tasks. Experimental results demonstrate that ChemVLM achieves competitive performance across all evaluated tasks. Our model can be found at https://huggingface.co/AI4Chem/ChemVLM-26B.
Paper Structure (27 sections, 8 figures, 6 tables)

This paper contains 27 sections, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Overall architecture of ChemVLM. ChemVLM combines the advantage of an advanced vision transformer and a large language enriched with chemical knowledge, ensuring the strong ability of multimodal chemical knowledge understanding and reasoning.
  • Figure 2: Data distribution of our train data and benchmarks.
  • Figure 3: Overview of our data composition work. This multi-step process ensures our model's good performance and a comprehensive evaluation.
  • Figure 4: In the left figure, we compare ChemVLM with three other MLLMs on other subjects aside from chemistry on CMMU. In the right figure, we show results on the subsets related to chemistry on Scibench. The numbers represent the performance of ChemVLM.
  • Figure 5: A qualitative comparison of answers on MMCR-Bench between GPT-4V and our ChemVLM. Mistakes within the answers are highlighted in red, whereas detailed and accurate parts are emphasized in green. Since this is a Chinese exam question, we prepare the original Chinese text and the English translation of it. This shows the strong MMCR capability of ChemVLM.
  • ...and 3 more figures