Table of Contents
Fetching ...

ChemMLLM: Chemical Multimodal Large Language Model

Qian Tan, Dongzhan Zhou, Peng Xia, Wanhao Liu, Wanli Ouyang, Lei Bai, Yuqiang Li, Tianfan Fu

TL;DR

ChemMLLM introduces a unified chemical multimodal large language model that jointly understands and generates molecules across text, SMILES, and molecule images. It pairs a Mol-VQGAN image tokenizer with an LLM in an Image Tokenizer-LLM-Image De-tokenizer framework and employs a two-stage training pipeline to align multimodal representations. The authors design five cross-modal chemistry tasks and curate datasets to evaluate the model, demonstrating superior performance over general MLLMs and chemical LLM baselines, with particularly strong results in image captioning, image-to-property prediction, and image-driven molecule design. This work furnishes a versatile platform for interactive chemical reasoning and has potential impact on drug discovery and materials design, while outlining future directions to incorporate additional modalities and real-world validation.

Abstract

Multimodal large language models (MLLMs) have made impressive progress in many applications in recent years. However, chemical MLLMs that can handle cross-modal understanding and generation remain underexplored. To fill this gap, we propose ChemMLLM, a unified chemical multimodal large language model for molecule understanding and generation. Also, we design five multimodal tasks across text, molecular SMILES strings, and image, and curate the datasets. We benchmark ChemMLLM against a range of general leading MLLMs and Chemical LLMs on these tasks. Experimental results show that ChemMLLM achieves superior performance across all evaluated tasks. For example, in molecule image optimization task, ChemMLLM outperforms the best baseline (GPT-4o) by 116.75\% (4.27 vs 1.97 property improvement). The code is publicly available at https://github.com/bbsbz/ChemMLLM.git.

ChemMLLM: Chemical Multimodal Large Language Model

TL;DR

ChemMLLM introduces a unified chemical multimodal large language model that jointly understands and generates molecules across text, SMILES, and molecule images. It pairs a Mol-VQGAN image tokenizer with an LLM in an Image Tokenizer-LLM-Image De-tokenizer framework and employs a two-stage training pipeline to align multimodal representations. The authors design five cross-modal chemistry tasks and curate datasets to evaluate the model, demonstrating superior performance over general MLLMs and chemical LLM baselines, with particularly strong results in image captioning, image-to-property prediction, and image-driven molecule design. This work furnishes a versatile platform for interactive chemical reasoning and has potential impact on drug discovery and materials design, while outlining future directions to incorporate additional modalities and real-world validation.

Abstract

Multimodal large language models (MLLMs) have made impressive progress in many applications in recent years. However, chemical MLLMs that can handle cross-modal understanding and generation remain underexplored. To fill this gap, we propose ChemMLLM, a unified chemical multimodal large language model for molecule understanding and generation. Also, we design five multimodal tasks across text, molecular SMILES strings, and image, and curate the datasets. We benchmark ChemMLLM against a range of general leading MLLMs and Chemical LLMs on these tasks. Experimental results show that ChemMLLM achieves superior performance across all evaluated tasks. For example, in molecule image optimization task, ChemMLLM outperforms the best baseline (GPT-4o) by 116.75\% (4.27 vs 1.97 property improvement). The code is publicly available at https://github.com/bbsbz/ChemMLLM.git.

Paper Structure

This paper contains 38 sections, 4 equations, 18 figures, 19 tables.

Figures (18)

  • Figure 1: Motivation. Comparison between task-specific sequence-to-sequence (Seq2Seq) cho2014learning model and a unified chemical large language model.
  • Figure 2: Overall architecture of ChemMLLM. (a) Image&SMILES tokenizer and de-tokenizer. The image tokenizer employs CNN to extract spatial feature maps, where each $n_z$-dimensional spatial code is quantized into a discrete latent code via vector quantization (VQ). The resulting codebook indices serve as the final image tokens. Image de-tokenizer uses CNN to reconstruct image from discrete feature map. Then, a patch-based discriminator predicts whether the patch is fake (f) or real (r); SMILES tokenization is consistent with text, which is mapped into a token sequence via text tokenizer; (b) ChemMLLM training; (c) ChemMLLM inference; (d) two-stage training paradigm.
  • Figure 3: Performance of ChemMLLM-7B, ChemMLLM-34B and the best baseline on five tasks. Mean Pearson means the mean value of Pearson correlation of seven properties; Avg Sim means Tanimoto similarities; Normalized $\Delta$LogP means normalized (i.e., divided by the maximum value) Increased LogP.
  • Figure 4: An example on img2caption task, comparison between Qwen-VL-Chat and our ChemMLLM.
  • Figure 5: A comparison of answers on img2property task on Qwen-VL-Chat and our ChemMLLM. Accurate answers are highlighted in bottle-green, close answers are highlighted in light-green and inaccurate answers are highlighted in red.
  • ...and 13 more figures