Table of Contents
Fetching ...

3D-MolT5: Leveraging Discrete Structural Information for Molecule-Text Modeling

Qizhi Pei, Rui Yan, Kaiyuan Gao, Jinhua Zhu, Lijun Wu

TL;DR

3D-MolT5 addresses the gap in molecule-text modeling by introducing a discrete 3D token vocabulary derived from E3FP, enabling joint modeling of molecular sequences, 3D structures, and natural language within a single T5-style architecture. The framework uses a five-task pre-training scheme (denoising and translation) to promote strong cross-modal interaction, followed by instruction tuning across diverse molecule-related tasks. Empirically, it achieves state-of-the-art or near-state-of-the-art performance on 3D-dependent properties, 3D captioning, and text-based molecule generation, outperforming baselines that rely on separate 3D encoders. The approach offers a scalable, SE(3)-invariant discrete representation that enhances generalization and cross-modal understanding, with practical implications for accelerated molecular discovery and description generation.

Abstract

The integration of molecular and natural language representations has emerged as a focal point in molecular science, with recent advancements in Language Models (LMs) demonstrating significant potential for comprehensive modeling of both domains. However, existing approaches face notable limitations, particularly in their neglect of three-dimensional (3D) information, which is crucial for understanding molecular structures and functions. While some efforts have been made to incorporate 3D molecular information into LMs using external structure encoding modules, significant difficulties remain, such as insufficient interaction across modalities in pre-training and challenges in modality alignment. To address the limitations, we propose \textbf{3D-MolT5}, a unified framework designed to model molecule in both sequence and 3D structure spaces. The key innovation of our approach lies in mapping fine-grained 3D substructure representations into a specialized 3D token vocabulary. This methodology facilitates the seamless integration of sequence and structure representations in a tokenized format, enabling 3D-MolT5 to encode molecular sequences, molecular structures, and text sequences within a unified architecture. Leveraging this tokenized input strategy, we build a foundation model that unifies the sequence and structure data formats. We then conduct joint pre-training with multi-task objectives to enhance the model's comprehension of these diverse modalities within a shared representation space. Thus, our approach significantly improves cross-modal interaction and alignment, addressing key challenges in previous work. Further instruction tuning demonstrated that our 3D-MolT5 has strong generalization ability and surpasses existing methods with superior performance in multiple downstream tasks. Our code is available at https://github.com/QizhiPei/3D-MolT5.

3D-MolT5: Leveraging Discrete Structural Information for Molecule-Text Modeling

TL;DR

3D-MolT5 addresses the gap in molecule-text modeling by introducing a discrete 3D token vocabulary derived from E3FP, enabling joint modeling of molecular sequences, 3D structures, and natural language within a single T5-style architecture. The framework uses a five-task pre-training scheme (denoising and translation) to promote strong cross-modal interaction, followed by instruction tuning across diverse molecule-related tasks. Empirically, it achieves state-of-the-art or near-state-of-the-art performance on 3D-dependent properties, 3D captioning, and text-based molecule generation, outperforming baselines that rely on separate 3D encoders. The approach offers a scalable, SE(3)-invariant discrete representation that enhances generalization and cross-modal understanding, with practical implications for accelerated molecular discovery and description generation.

Abstract

The integration of molecular and natural language representations has emerged as a focal point in molecular science, with recent advancements in Language Models (LMs) demonstrating significant potential for comprehensive modeling of both domains. However, existing approaches face notable limitations, particularly in their neglect of three-dimensional (3D) information, which is crucial for understanding molecular structures and functions. While some efforts have been made to incorporate 3D molecular information into LMs using external structure encoding modules, significant difficulties remain, such as insufficient interaction across modalities in pre-training and challenges in modality alignment. To address the limitations, we propose \textbf{3D-MolT5}, a unified framework designed to model molecule in both sequence and 3D structure spaces. The key innovation of our approach lies in mapping fine-grained 3D substructure representations into a specialized 3D token vocabulary. This methodology facilitates the seamless integration of sequence and structure representations in a tokenized format, enabling 3D-MolT5 to encode molecular sequences, molecular structures, and text sequences within a unified architecture. Leveraging this tokenized input strategy, we build a foundation model that unifies the sequence and structure data formats. We then conduct joint pre-training with multi-task objectives to enhance the model's comprehension of these diverse modalities within a shared representation space. Thus, our approach significantly improves cross-modal interaction and alignment, addressing key challenges in previous work. Further instruction tuning demonstrated that our 3D-MolT5 has strong generalization ability and surpasses existing methods with superior performance in multiple downstream tasks. Our code is available at https://github.com/QizhiPei/3D-MolT5.
Paper Structure (38 sections, 2 equations, 6 figures, 15 tables, 1 algorithm)

This paper contains 38 sections, 2 equations, 6 figures, 15 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overview of the 3D-MolT5 multi-task pre-training. The upper 4 tasks involve the "recover masked spans" task, where consecutive spans of the input are replaced with sentinel tokens such as <X>, <Y>, <Z>. The bottom 3 tasks are translation tasks. The input modalities are annotated with small icons. Tokens with 3D structure information are colored in blue, and [3D] refers to 3D tokens.
  • Figure 2: The process of 3D molecular tokenization and alignment between 1D SELFIES tokens and 3D tokens. We choose one conformer of the 2-(Formylamino)benzoic acid (CID: 101399) as the example. At each iteration of E3FP, each atom and its neighborhood substructure is represented by a 3D token. The alignment between 1D SELFIES tokens and 3D tokens is shown at the bottom table.
  • Figure 3: Ablation studies on PubChemQC pubchemqc dataset. The evaluation metric is MAE.
  • Figure 4: Visualization of the E3FP process the second atom $a_1$ (Carbon) of the molecule with CID 101399 (same as the case in Figure \ref{['fig:3d_tokenization']}).
  • Figure 5: Visualization of 5 conformers and their corresponding 3D tokens for molecule with CID 101399 (same as Figure \ref{['fig:3d_tokenization']}). The difference among their 3D tokens are colored in red.
  • ...and 1 more figures