Table of Contents
Fetching ...

GIT-Mol: A Multi-modal Large Language Model for Molecular Science with Graph, Image, and Text

Pengfei Liu, Yiming Ren, Jun Tao, Zhixiang Ren

TL;DR

GIT-Mol tackles the challenge of fusing graph, image, and text data in molecular science by introducing GIT-Former, a cross-attentional modality mixer that maps heterogeneous data into a unified latent space. Trained on a large multi-modal corpus, the model supports molecule captioning, text-based de novo molecule generation, image recognition, and molecular property prediction, with an any-to-language translation strategy that enables downstream tasks. The approach yields consistent gains over baselines, including 5%-10% improvements in property prediction and a 20.2% increase in generation validity, while ablations show the complementary value of each modality. The work advances AI-aided drug discovery by enabling richer molecular representations and flexible downstream tasks, and provides data/code to facilitate further research.

Abstract

Large language models have made significant strides in natural language processing, enabling innovative applications in molecular science by processing textual representations of molecules. However, most existing language models cannot capture the rich information with complex molecular structures or images. In this paper, we introduce GIT-Mol, a multi-modal large language model that integrates the Graph, Image, and Text information. To facilitate the integration of multi-modal molecular data, we propose GIT-Former, a novel architecture that is capable of aligning all modalities into a unified latent space. We achieve a 5%-10% accuracy increase in properties prediction and a 20.2% boost in molecule generation validity compared to the baselines. With the any-to-language molecular translation strategy, our model has the potential to perform more downstream tasks, such as compound name recognition and chemical reaction prediction.

GIT-Mol: A Multi-modal Large Language Model for Molecular Science with Graph, Image, and Text

TL;DR

GIT-Mol tackles the challenge of fusing graph, image, and text data in molecular science by introducing GIT-Former, a cross-attentional modality mixer that maps heterogeneous data into a unified latent space. Trained on a large multi-modal corpus, the model supports molecule captioning, text-based de novo molecule generation, image recognition, and molecular property prediction, with an any-to-language translation strategy that enables downstream tasks. The approach yields consistent gains over baselines, including 5%-10% improvements in property prediction and a 20.2% increase in generation validity, while ablations show the complementary value of each modality. The work advances AI-aided drug discovery by enabling richer molecular representations and flexible downstream tasks, and provides data/code to facilitate further research.

Abstract

Large language models have made significant strides in natural language processing, enabling innovative applications in molecular science by processing textual representations of molecules. However, most existing language models cannot capture the rich information with complex molecular structures or images. In this paper, we introduce GIT-Mol, a multi-modal large language model that integrates the Graph, Image, and Text information. To facilitate the integration of multi-modal molecular data, we propose GIT-Former, a novel architecture that is capable of aligning all modalities into a unified latent space. We achieve a 5%-10% accuracy increase in properties prediction and a 20.2% boost in molecule generation validity compared to the baselines. With the any-to-language molecular translation strategy, our model has the potential to perform more downstream tasks, such as compound name recognition and chemical reaction prediction.
Paper Structure (21 sections, 7 equations, 4 figures, 4 tables)

This paper contains 21 sections, 7 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: An overview of GIT-Mol. (a) Internal Information, including sequence and graph structure representations, emphasizes inherent chemical properties and simple topology; (b) External Information, e.g., images and text descriptions, provide richer details and help the human understanding; (c) GIT-Former Multi-modal Encoder, architecture and Pre-train Strategy of GIT-Former, GIT-Former aligns graph, image, and text with the target text modality (SMILES strings or captions) using self-attention and cross-attention. The learnable queries interact with each other and the various modalities through these attention layers. Xmodal-Text Matching (XTM) and Xmodal-Text Contrastive Learning (XTC) represent our self-supervised learning strategies tailored for specific modalities (X) and target text modalities; (d) Multi-modal Molecular Tasks, in cross-modal tasks, GIT-Former generates different Embeddings based on various inputs, which MolT5 then decodes into the target text modality and the MLP model for property prediction tasks.
  • Figure 2: Study case of Molecule Caption. The GIT-Mol model exhibits precise chemical characterization, aligning closely with ground truth information.
  • Figure 3: Study case of Molecule Generation. GIT-Mol stands out for its ability to generate valid molecules. Besides, even if not identical to the ground truth, it still faithfully adhere to the features described in the textual captions.
  • Figure 4: Embeddings visualization. (a) Original vector representations from various molecular data modalities. (b) Vector distribution is processed by an untrained GIT-Former, illustrating a tendency towards uniformity. (c) Hierarchical vector distribution processed by a pre-trained GIT-Former showcasing the layered separation of modalities with the outermost layer representing graph embeddings, followed by image embeddings, and innermost containing SMILES strings and captions from the text modality. (d) Distribution of atoms in molecules, with color gradients indicating increasing atom numbers. (e) Results of K-means clustering applied to molecular data. (f) Distribution of C, N, and O atoms across different clusters. The pre-training effects demonstrate GIT-Former's ability to differentiate among modalities, subtypes within a modality, and specific properties within a given data type.