Table of Contents
Fetching ...

GraphT5: Unified Molecular Graph-Language Modeling via Multi-Modal Cross-Token Attention

Sangyeup Kim, Nayeon Kim, Yinhua Piao, Sun Kim

TL;DR

GraphT5 addresses the challenge of integrating 1D SMILES text with 2D molecular graphs for molecular language modeling. It introduces a cross-token attention mechanism that enables token-level interactions between SMILES and graph modalities, and leverages a T5 backbone with a graph encoder pre-trained via GraphMVP. The approach achieves state-of-the-art results on molecule captioning and IUPAC name prediction across PubChem324k and ChEBI-20, with ablations confirming the benefits of multi-modal inputs and cross-modal interactions. This multi-modal fusion enhances the grounding of textual descriptions in molecular structure, offering improved generation quality with potential impact on drug discovery and materials science.

Abstract

Molecular language modeling tasks such as molecule captioning have been recognized for their potential to further understand molecular properties that can aid drug discovery or material synthesis based on chemical reactions. Unlike the common use of molecule graphs in predicting molecular properties, most methods in molecular language modeling rely heavily on SMILES sequences. This preference is because the task involves generating a sequence of multiple tokens using transformer-based models. Therefore, a main challenge is determining how to integrate graph data, which contains structural and spatial information about molecules, with text data. In addition, simply using both 1D SMILES text and 2D graph as inputs without addressing how they align and represent the molecule structure in different modalities makes it challenging to fully utilize structural knowledge about molecules. To this end, we propose GraphT5, a multi-modal framework that integrates 1D SMILES text and 2D graph representations of molecules for molecular language modeling. Specifically, we introduce a novel cross-token attention module in GraphT5 to bridge the gap arising from the fundamental differences between the two modalities of molecule representations. Cross-token attention exploits implicit information between SMILES and graphs of molecules, resulting from their interactions at a fine-grained token level that benefits molecular language modeling. Extensive experiments including molecule captioning, IUPAC name prediction tasks, and case studies show that our GraphT5 outperforms the latest baseline approaches, which validates the effectiveness of our GraphT5 in sufficiently utilizing 1D SMILES text and 2D graph representations.

GraphT5: Unified Molecular Graph-Language Modeling via Multi-Modal Cross-Token Attention

TL;DR

GraphT5 addresses the challenge of integrating 1D SMILES text with 2D molecular graphs for molecular language modeling. It introduces a cross-token attention mechanism that enables token-level interactions between SMILES and graph modalities, and leverages a T5 backbone with a graph encoder pre-trained via GraphMVP. The approach achieves state-of-the-art results on molecule captioning and IUPAC name prediction across PubChem324k and ChEBI-20, with ablations confirming the benefits of multi-modal inputs and cross-modal interactions. This multi-modal fusion enhances the grounding of textual descriptions in molecular structure, offering improved generation quality with potential impact on drug discovery and materials science.

Abstract

Molecular language modeling tasks such as molecule captioning have been recognized for their potential to further understand molecular properties that can aid drug discovery or material synthesis based on chemical reactions. Unlike the common use of molecule graphs in predicting molecular properties, most methods in molecular language modeling rely heavily on SMILES sequences. This preference is because the task involves generating a sequence of multiple tokens using transformer-based models. Therefore, a main challenge is determining how to integrate graph data, which contains structural and spatial information about molecules, with text data. In addition, simply using both 1D SMILES text and 2D graph as inputs without addressing how they align and represent the molecule structure in different modalities makes it challenging to fully utilize structural knowledge about molecules. To this end, we propose GraphT5, a multi-modal framework that integrates 1D SMILES text and 2D graph representations of molecules for molecular language modeling. Specifically, we introduce a novel cross-token attention module in GraphT5 to bridge the gap arising from the fundamental differences between the two modalities of molecule representations. Cross-token attention exploits implicit information between SMILES and graphs of molecules, resulting from their interactions at a fine-grained token level that benefits molecular language modeling. Extensive experiments including molecule captioning, IUPAC name prediction tasks, and case studies show that our GraphT5 outperforms the latest baseline approaches, which validates the effectiveness of our GraphT5 in sufficiently utilizing 1D SMILES text and 2D graph representations.

Paper Structure

This paper contains 26 sections, 8 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Molecule captioning with different input modalities and encoders. (a) 1D SMILES (Simplified Molecular Input Line Entry System) weininger1988smiles with language model (e.g. T5-based model). (b) 1D SMILES and 2D graph as input for graph and SMILES encoders, using text decoder. (c) 1D SMILES and 2D graph with cross-attention between SMILES and graph as input for graph and SMILES encoders, using text decoder.
  • Figure 2: Overview of the proposed GraphT5. 1D SMILES text and 2D graph representations of the given molecule are fed into the SMILES encoder and graph encoder respectively. The following cross-token attention leverages the 1D SMILES and 2D graph representations of the molecule, resulting in token-level interaction reflected in the graph embeddings. After cross-token attention, residual connection and self-attention mechanisms are applied. The output graph embedding is summarized into a single vector by mean-pooling operation. The context vector for encoder-decoder attention in the text decoder is composed of the summarized graph vector, graph embeddings with cross-token attention applied, and the original SMILES embeddings. From the decoder, a caption of the given molecule is generated.
  • Figure 3: BLEU-2 and BLUE-4 score results for GraphT5 and 1D SMILES approach without graph utilization. The generated captions are evaluated in three groups, as divided by the length of the original description. Therefore, the robustness of the model to the lengths of the molecule descriptions can be validated.
  • Figure 4: IUPAC name of a molecule reflects the structural characteristics of the molecule. The highlighted regions of the graph and SMILES representations stand for the same colored part of the IUPAC name.
  • Figure 5: Examples of generated results from molecule captioning task. We compare our GraphT5 result with Text+Chem T5 which is a 1D SMILES text-based approach, and MolCA which utilizes 2D graphs as well as 1D SMILES but lacks cross-token attention. We highlight the correctly generated parts from our GraphT5 which cannot be found in MolCA with red color, and in the case of Text+Chem T5, those parts are underlined. Since the generated captions are long, some of the middle parts that mostly share the contents among the examples are replaced by '(...)'.
  • ...and 1 more figures