Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey

Qizhi Pei; Zhimeng Zhou; Kaiyuan Gao; Jinhua Zhu; Yue Wang; Zun Wang; Tao Qin; Lijun Wu; Rui Yan

Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey

Qizhi Pei, Zhimeng Zhou, Kaiyuan Gao, Jinhua Zhu, Yue Wang, Zun Wang, Tao Qin, Lijun Wu, Rui Yan

TL;DR

This survey articulates the need to fuse biomolecular modeling with natural language for richer, context-aware representations. It surveys biomolecule representations (1D/2D/3D and alternatives), text sources, and a taxonomy of cross-modal learning architectures (encoder/decoder, dual-stream, PaLM-E, GILL), along with representation and learning strategies (MLM, NTP, CMA, SCL) and practical applications. It highlights three core goals—Knowledgeable, versatile integration, and agentic interactive systems—underpinned by datasets, benchmarks, and pre-training objectives. The authors also discuss challenges in tokenization, data scarcity, generalization, and ethics, offering concrete directions like structure-aware tokenization, data augmentation, and multi-agent discovery pipelines. Collectively, the work provides a structured reference for advancing biomolecule-language models that can accelerate discovery and understanding in biology and chemistry.

Abstract

The integration of biomolecular modeling with natural language (BL) has emerged as a promising interdisciplinary area at the intersection of artificial intelligence, chemistry and biology. This approach leverages the rich, multifaceted descriptions of biomolecules contained within textual data sources to enhance our fundamental understanding and enable downstream computational tasks such as biomolecule property prediction. The fusion of the nuanced narratives expressed through natural language with the structural and functional specifics of biomolecules described via various molecular modeling techniques opens new avenues for comprehensively representing and analyzing biomolecules. By incorporating the contextual language data that surrounds biomolecules into their modeling, BL aims to capture a holistic view encompassing both the symbolic qualities conveyed through language as well as quantitative structural characteristics. In this review, we provide an extensive analysis of recent advancements achieved through cross modeling of biomolecules and natural language. (1) We begin by outlining the technical representations of biomolecules employed, including sequences, 2D graphs, and 3D structures. (2) We then examine in depth the rationale and key objectives underlying effective multi-modal integration of language and molecular data sources. (3) We subsequently survey the practical applications enabled to date in this developing research area. (4) We also compile and summarize the available resources and datasets to facilitate future work. (5) Looking ahead, we identify several promising research directions worthy of further exploration and investment to continue advancing the field. The related resources and contents are updating in https://github.com/QizhiPei/Awesome-Biomolecule-Language-Cross-Modeling.

Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey

TL;DR

Abstract

Paper Structure (57 sections, 5 equations, 7 figures, 4 tables)

This paper contains 57 sections, 5 equations, 7 figures, 4 tables.

Introduction
Biomolecule Representation
1D Sequence
2D Graph
3D Structure
Alternative Bio-Representations
Biomolecule-Related Text Sources
Foundations of Cross-Modal Integration
Intuition for Cross-Modal Integration
Paradigm Shift: From Traditional Deep Learning to LLMs
Goals for Cross-Modal Integration
Knowledgeable: Building Deep and Integrated Representations
Versatile: Toward Generalizable and Interactive Intelligence
Learning Framework
Encoder/Decoder-only Model
...and 42 more sections

Figures (7)

Figure 1: Overview of cross-modal integration methods in BL, which are categorized based on the modality and biorepresentation. Here "BioMulti" refers to settings that encompass text, molecule, protein, and potentially additional biological entities such as materials and cells. A minority of works that do not neatly fit into representative categories are classified into "others".
Figure 2: A chronological overview of BL models proposed in recent years. Different colored rectangles correspond to different input modalities of the model. Cross-modal modeling involving multiple modalities has grown in popularity over time.
Figure 3: Representations of text, molecules, and proteins. Text: 1D token sequences. Molecules: 1D strings (SMILES, IUPAC, SELFIES), 2D graphs, 3D conformations. Proteins: 1D amino acid sequences, 2D* secondary-structure abstraction, and 3D structures. 2D* denotes secondary structure as a coarse structural abstraction rather than a literal 2D spatial representation.
Figure 4: Cross-modal BL modeling demonstration: integrating protein sequence, molecular SMILES, and text for downstream generation and reasoning.
Figure 5: Dual goals of BL modeling: Knowledgeable and Versatile.
...and 2 more figures

Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey

TL;DR

Abstract

Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey

Authors

TL;DR

Abstract

Table of Contents

Figures (7)