Table of Contents
Fetching ...

MolBind: Multimodal Alignment of Language, Molecules, and Proteins

Teng Xiao, Chao Cui, Huaisheng Zhu, Vasant G. Honavar

TL;DR

MolBind tackles multi-modal representation learning across language, molecular structure, and protein contexts by learning a shared embedding space through cross-modal contrastive learning. It jointly encodes language, 2D graphs, 3D conformations, and 3D protein pockets with modality-specific encoders and trains with four cross-modal losses to align semantic content, enabling zero-shot retrieval and classification. The authors introduce MolBind-M4, a four-modality dataset built from open sources, and demonstrate strong zero-shot performance on molecule-language retrieval, IUPAC name classification, and molecule-to-protein retrieval, with ablations validating the contribution of each modality. The approach offers a scalable path to integrating diverse biomedical modalities for drug discovery and molecular reasoning, potentially reducing data constraints and enabling transfer across related tasks.

Abstract

Recent advancements in biology and chemistry have leveraged multi-modal learning, integrating molecules and their natural language descriptions to enhance drug discovery. However, current pre-training frameworks are limited to two modalities, and designing a unified network to process different modalities (e.g., natural language, 2D molecular graphs, 3D molecular conformations, and 3D proteins) remains challenging due to inherent gaps among them. In this work, we propose MolBind, a framework that trains encoders for multiple modalities through contrastive learning, mapping all modalities to a shared feature space for multi-modal semantic alignment. To facilitate effective pre-training of MolBind on multiple modalities, we also build and collect a high-quality dataset with four modalities, MolBind-M4, including graph-language, conformation-language, graph-conformation, and conformation-protein paired data. MolBind shows superior zero-shot learning performance across a wide range of tasks, demonstrating its strong capability of capturing the underlying semantics of multiple modalities.

MolBind: Multimodal Alignment of Language, Molecules, and Proteins

TL;DR

MolBind tackles multi-modal representation learning across language, molecular structure, and protein contexts by learning a shared embedding space through cross-modal contrastive learning. It jointly encodes language, 2D graphs, 3D conformations, and 3D protein pockets with modality-specific encoders and trains with four cross-modal losses to align semantic content, enabling zero-shot retrieval and classification. The authors introduce MolBind-M4, a four-modality dataset built from open sources, and demonstrate strong zero-shot performance on molecule-language retrieval, IUPAC name classification, and molecule-to-protein retrieval, with ablations validating the contribution of each modality. The approach offers a scalable path to integrating diverse biomedical modalities for drug discovery and molecular reasoning, potentially reducing data constraints and enabling transfer across related tasks.

Abstract

Recent advancements in biology and chemistry have leveraged multi-modal learning, integrating molecules and their natural language descriptions to enhance drug discovery. However, current pre-training frameworks are limited to two modalities, and designing a unified network to process different modalities (e.g., natural language, 2D molecular graphs, 3D molecular conformations, and 3D proteins) remains challenging due to inherent gaps among them. In this work, we propose MolBind, a framework that trains encoders for multiple modalities through contrastive learning, mapping all modalities to a shared feature space for multi-modal semantic alignment. To facilitate effective pre-training of MolBind on multiple modalities, we also build and collect a high-quality dataset with four modalities, MolBind-M4, including graph-language, conformation-language, graph-conformation, and conformation-protein paired data. MolBind shows superior zero-shot learning performance across a wide range of tasks, demonstrating its strong capability of capturing the underlying semantics of multiple modalities.
Paper Structure (17 sections, 5 equations, 2 figures, 4 tables)

This paper contains 17 sections, 5 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Illustration of modality data we try to align in this paper. The scientific language: text describes the name, property, and structure of molecules. 2D molecular graph: the atoms and bonds are nodes and edges in the graph. 3D molecular conformation: the atoms in the 3D Euclidean space. 3D protein pocket: a specific 3D region on the surface of a protein that can bind or interact with other molecules.
  • Figure 2: Zero-shot molecule classification with language.