GraphT5: Unified Molecular Graph-Language Modeling via Multi-Modal Cross-Token Attention
Sangyeup Kim, Nayeon Kim, Yinhua Piao, Sun Kim
TL;DR
GraphT5 addresses the challenge of integrating 1D SMILES text with 2D molecular graphs for molecular language modeling. It introduces a cross-token attention mechanism that enables token-level interactions between SMILES and graph modalities, and leverages a T5 backbone with a graph encoder pre-trained via GraphMVP. The approach achieves state-of-the-art results on molecule captioning and IUPAC name prediction across PubChem324k and ChEBI-20, with ablations confirming the benefits of multi-modal inputs and cross-modal interactions. This multi-modal fusion enhances the grounding of textual descriptions in molecular structure, offering improved generation quality with potential impact on drug discovery and materials science.
Abstract
Molecular language modeling tasks such as molecule captioning have been recognized for their potential to further understand molecular properties that can aid drug discovery or material synthesis based on chemical reactions. Unlike the common use of molecule graphs in predicting molecular properties, most methods in molecular language modeling rely heavily on SMILES sequences. This preference is because the task involves generating a sequence of multiple tokens using transformer-based models. Therefore, a main challenge is determining how to integrate graph data, which contains structural and spatial information about molecules, with text data. In addition, simply using both 1D SMILES text and 2D graph as inputs without addressing how they align and represent the molecule structure in different modalities makes it challenging to fully utilize structural knowledge about molecules. To this end, we propose GraphT5, a multi-modal framework that integrates 1D SMILES text and 2D graph representations of molecules for molecular language modeling. Specifically, we introduce a novel cross-token attention module in GraphT5 to bridge the gap arising from the fundamental differences between the two modalities of molecule representations. Cross-token attention exploits implicit information between SMILES and graphs of molecules, resulting from their interactions at a fine-grained token level that benefits molecular language modeling. Extensive experiments including molecule captioning, IUPAC name prediction tasks, and case studies show that our GraphT5 outperforms the latest baseline approaches, which validates the effectiveness of our GraphT5 in sufficiently utilizing 1D SMILES text and 2D graph representations.
