MolBind: Multimodal Alignment of Language, Molecules, and Proteins
Teng Xiao, Chao Cui, Huaisheng Zhu, Vasant G. Honavar
TL;DR
MolBind tackles multi-modal representation learning across language, molecular structure, and protein contexts by learning a shared embedding space through cross-modal contrastive learning. It jointly encodes language, 2D graphs, 3D conformations, and 3D protein pockets with modality-specific encoders and trains with four cross-modal losses to align semantic content, enabling zero-shot retrieval and classification. The authors introduce MolBind-M4, a four-modality dataset built from open sources, and demonstrate strong zero-shot performance on molecule-language retrieval, IUPAC name classification, and molecule-to-protein retrieval, with ablations validating the contribution of each modality. The approach offers a scalable path to integrating diverse biomedical modalities for drug discovery and molecular reasoning, potentially reducing data constraints and enabling transfer across related tasks.
Abstract
Recent advancements in biology and chemistry have leveraged multi-modal learning, integrating molecules and their natural language descriptions to enhance drug discovery. However, current pre-training frameworks are limited to two modalities, and designing a unified network to process different modalities (e.g., natural language, 2D molecular graphs, 3D molecular conformations, and 3D proteins) remains challenging due to inherent gaps among them. In this work, we propose MolBind, a framework that trains encoders for multiple modalities through contrastive learning, mapping all modalities to a shared feature space for multi-modal semantic alignment. To facilitate effective pre-training of MolBind on multiple modalities, we also build and collect a high-quality dataset with four modalities, MolBind-M4, including graph-language, conformation-language, graph-conformation, and conformation-protein paired data. MolBind shows superior zero-shot learning performance across a wide range of tasks, demonstrating its strong capability of capturing the underlying semantics of multiple modalities.
