3D-MolT5: Leveraging Discrete Structural Information for Molecule-Text Modeling
Qizhi Pei, Rui Yan, Kaiyuan Gao, Jinhua Zhu, Lijun Wu
TL;DR
3D-MolT5 addresses the gap in molecule-text modeling by introducing a discrete 3D token vocabulary derived from E3FP, enabling joint modeling of molecular sequences, 3D structures, and natural language within a single T5-style architecture. The framework uses a five-task pre-training scheme (denoising and translation) to promote strong cross-modal interaction, followed by instruction tuning across diverse molecule-related tasks. Empirically, it achieves state-of-the-art or near-state-of-the-art performance on 3D-dependent properties, 3D captioning, and text-based molecule generation, outperforming baselines that rely on separate 3D encoders. The approach offers a scalable, SE(3)-invariant discrete representation that enhances generalization and cross-modal understanding, with practical implications for accelerated molecular discovery and description generation.
Abstract
The integration of molecular and natural language representations has emerged as a focal point in molecular science, with recent advancements in Language Models (LMs) demonstrating significant potential for comprehensive modeling of both domains. However, existing approaches face notable limitations, particularly in their neglect of three-dimensional (3D) information, which is crucial for understanding molecular structures and functions. While some efforts have been made to incorporate 3D molecular information into LMs using external structure encoding modules, significant difficulties remain, such as insufficient interaction across modalities in pre-training and challenges in modality alignment. To address the limitations, we propose \textbf{3D-MolT5}, a unified framework designed to model molecule in both sequence and 3D structure spaces. The key innovation of our approach lies in mapping fine-grained 3D substructure representations into a specialized 3D token vocabulary. This methodology facilitates the seamless integration of sequence and structure representations in a tokenized format, enabling 3D-MolT5 to encode molecular sequences, molecular structures, and text sequences within a unified architecture. Leveraging this tokenized input strategy, we build a foundation model that unifies the sequence and structure data formats. We then conduct joint pre-training with multi-task objectives to enhance the model's comprehension of these diverse modalities within a shared representation space. Thus, our approach significantly improves cross-modal interaction and alignment, addressing key challenges in previous work. Further instruction tuning demonstrated that our 3D-MolT5 has strong generalization ability and surpasses existing methods with superior performance in multiple downstream tasks. Our code is available at https://github.com/QizhiPei/3D-MolT5.
