MolFM: A Multimodal Molecular Foundation Model

Yizhen Luo; Kai Yang; Massimo Hong; Xing Yi Liu; Zaiqing Nie

MolFM: A Multimodal Molecular Foundation Model

Yizhen Luo, Kai Yang, Massimo Hong, Xing Yi Liu, Zaiqing Nie

TL;DR

MolFM addresses the challenge of integrating molecular structures, biomedical texts, and knowledge graphs by employing a tri-modal architecture with cross-attention in a dedicated multimodal encoder. It introduces four pre-training objectives (structure-text contrastive, cross-modal matching, MLM, and KG embedding) and provides theoretical Justifications linking these losses to deep metric learning, enabling grounding across modalities. Empirically, MolFM achieves state-of-the-art performance on cross-modal retrieval, molecule captioning, text-to-molecule generation, and molecular property prediction, with notable zero-shot gains and qualitative grounding demonstrations. The work demonstrates the value of incorporating global KG knowledge alongside local structure-text cues, offering a scalable path toward more comprehensive biomedical molecular understanding, while acknowledging data biases and safety considerations.

Abstract

Molecular knowledge resides within three different modalities of information sources: molecular structures, biomedical documents, and knowledge bases. Effective incorporation of molecular knowledge from these modalities holds paramount significance in facilitating biomedical research. However, existing multimodal molecular foundation models exhibit limitations in capturing intricate connections between molecular structures and texts, and more importantly, none of them attempt to leverage a wealth of molecular expertise derived from knowledge graphs. In this study, we introduce MolFM, a multimodal molecular foundation model designed to facilitate joint representation learning from molecular structures, biomedical texts, and knowledge graphs. We propose cross-modal attention between atoms of molecular structures, neighbors of molecule entities and semantically related texts to facilitate cross-modal comprehension. We provide theoretical analysis that our cross-modal pre-training captures local and global molecular knowledge by minimizing the distance in the feature space between different modalities of the same molecule, as well as molecules sharing similar structures or functions. MolFM achieves state-of-the-art performance on various downstream tasks. On cross-modal retrieval, MolFM outperforms existing models with 12.13% and 5.04% absolute gains under the zero-shot and fine-tuning settings, respectively. Furthermore, qualitative analysis showcases MolFM's implicit ability to provide grounding from molecular substructures and knowledge graphs. Code and models are available on https://github.com/BioFM/OpenBioMed.

MolFM: A Multimodal Molecular Foundation Model

TL;DR

Abstract

Paper Structure (32 sections, 4 theorems, 32 equations, 15 figures, 15 tables)

This paper contains 32 sections, 4 theorems, 32 equations, 15 figures, 15 tables.

Introduction
Related works
MolFM Pre-training
Model architecture
Pre-training objectives
Theoretical justifications
Pre-training dataset and knowledge graph
Implementation details
Downstream tasks
Experiments
Ablation studies
Evaluation on cross-modal retrieval
Evaluation on molecule captioning
Evaluation on text-to-molecule generation
Evaluation on molecular property prediction
...and 17 more sections

Key Result

Lemma 1

Let $r_{s}$ be a symmetric relation indicating structural similarity. Assuming that structurally similar molecules $h$ and $t$ satisfies $(h,r_{s},t)\in KG$ and $(t,r_{s},h)\in KG$, the following holds:

Figures (15)

Figure 1: Pre-training pipeline of MolFM. We formulate the knowledge graph input for each molecule (dashed circle) as the corresponding entity (orange node) and its 1-hop neighbors. MolFM employs three independent single-modal encoders to convert multimodal inputs into feature vectors. Additionally, it comprises a multimodal encoder to integrate fine-grained connections between atoms, neighboring entities and textual tokens. We leverage structure-text contrastive learning to align the feature space between two modalities, cross-modal matching loss and masked language modeling loss to promote a holistic understanding of multimodal information, and a knowledge embedding loss as a regularization term.
Figure 2: Model architecture for downstream tasks. For cross-modal retrieval, we re-rank top-k retrieved results with an ensemble of cosine similarity and CMM logit. For molecule captioning, we concatenate MolFM's structure encoder outputs with MolT5 encoder outputs, and use the MolT5 decoder to generate texts. For text-to-molecule generation, we append a MolT5 decoder to generate SMILES strings. For molecular property prediction, we concatenate the output of structure encoder and multimodal encoder to fit the molecular property.
Figure 3: Molecule captioning examples. We highlight the text segments where MolFM generates more accurate expressions.
Figure 4: Examples of text-to-molecule generation examples, along with the Morgan fingerprint Tanimoto similarity between the generated molecules and the ground truth.
Figure 5: Visualization of atom attention with different input texts.
...and 10 more figures

Theorems & Definitions (10)

Lemma 1
Lemma 2
Definition A.1
Definition A.2
Definition A.3
Lemma A.1
proof
Definition A.4
Lemma A.2
proof

MolFM: A Multimodal Molecular Foundation Model

TL;DR

Abstract

MolFM: A Multimodal Molecular Foundation Model

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (10)