Table of Contents
Fetching ...

Rethinking Tokenizer and Decoder in Masked Graph Modeling for Molecules

Zhiyuan Liu, Yaorui Shi, An Zhang, Enzhi Zhang, Kenji Kawaguchi, Xiang Wang, Tat-Seng Chua

TL;DR

The paper analyzes Masked Graph Modeling for molecular representation learning, showing that tokenization and decoding choices critically shape learned representations. It demonstrates that subgraph-level tokenizers and sufficiently expressive decoders with remask decoding substantially improve encoder quality, addressing the gap in prior MGM work. The authors introduce SimSGT, which uses a Simple GNN-based Tokenizer and a GraphTrans-based encoder/decoder with remask-v2, achieving state-of-the-art results on MoleculeNet and related tasks while maintaining efficient pretraining. These findings offer a practical path to stronger self-supervised molecular representations and highlight promising directions for joint molecule–text modeling.

Abstract

Masked graph modeling excels in the self-supervised representation learning of molecular graphs. Scrutinizing previous studies, we can reveal a common scheme consisting of three key components: (1) graph tokenizer, which breaks a molecular graph into smaller fragments (i.e., subgraphs) and converts them into tokens; (2) graph masking, which corrupts the graph with masks; (3) graph autoencoder, which first applies an encoder on the masked graph to generate the representations, and then employs a decoder on the representations to recover the tokens of the original graph. However, the previous MGM studies focus extensively on graph masking and encoder, while there is limited understanding of tokenizer and decoder. To bridge the gap, we first summarize popular molecule tokenizers at the granularity of node, edge, motif, and Graph Neural Networks (GNNs), and then examine their roles as the MGM's reconstruction targets. Further, we explore the potential of adopting an expressive decoder in MGM. Our results show that a subgraph-level tokenizer and a sufficiently expressive decoder with remask decoding have a large impact on the encoder's representation learning. Finally, we propose a novel MGM method SimSGT, featuring a Simple GNN-based Tokenizer (SGT) and an effective decoding strategy. We empirically validate that our method outperforms the existing molecule self-supervised learning methods. Our codes and checkpoints are available at https://github.com/syr-cn/SimSGT.

Rethinking Tokenizer and Decoder in Masked Graph Modeling for Molecules

TL;DR

The paper analyzes Masked Graph Modeling for molecular representation learning, showing that tokenization and decoding choices critically shape learned representations. It demonstrates that subgraph-level tokenizers and sufficiently expressive decoders with remask decoding substantially improve encoder quality, addressing the gap in prior MGM work. The authors introduce SimSGT, which uses a Simple GNN-based Tokenizer and a GraphTrans-based encoder/decoder with remask-v2, achieving state-of-the-art results on MoleculeNet and related tasks while maintaining efficient pretraining. These findings offer a practical path to stronger self-supervised molecular representations and highlight promising directions for joint molecule–text modeling.

Abstract

Masked graph modeling excels in the self-supervised representation learning of molecular graphs. Scrutinizing previous studies, we can reveal a common scheme consisting of three key components: (1) graph tokenizer, which breaks a molecular graph into smaller fragments (i.e., subgraphs) and converts them into tokens; (2) graph masking, which corrupts the graph with masks; (3) graph autoencoder, which first applies an encoder on the masked graph to generate the representations, and then employs a decoder on the representations to recover the tokens of the original graph. However, the previous MGM studies focus extensively on graph masking and encoder, while there is limited understanding of tokenizer and decoder. To bridge the gap, we first summarize popular molecule tokenizers at the granularity of node, edge, motif, and Graph Neural Networks (GNNs), and then examine their roles as the MGM's reconstruction targets. Further, we explore the potential of adopting an expressive decoder in MGM. Our results show that a subgraph-level tokenizer and a sufficiently expressive decoder with remask decoding have a large impact on the encoder's representation learning. Finally, we propose a novel MGM method SimSGT, featuring a Simple GNN-based Tokenizer (SGT) and an effective decoding strategy. We empirically validate that our method outperforms the existing molecule self-supervised learning methods. Our codes and checkpoints are available at https://github.com/syr-cn/SimSGT.
Paper Structure (20 sections, 8 equations, 11 figures, 15 tables, 1 algorithm)

This paper contains 20 sections, 8 equations, 11 figures, 15 tables, 1 algorithm.

Figures (11)

  • Figure 1: The pipeline of Masked Graph Modeling.
  • Figure 2: Example of subgraph-level patterns in a molecule. SMILES: CC(=O)Nc1cccc(O)c1.
  • Figure 3: Examples for the first three types of graph tokenizers and their induced subgraphs. (b) A motif-based tokenizer that applies the fragmentation functions of cycles and the remaining nodes. (c) A two-layer GIN-based tokenizer that extracts 2-hop rooted subtrees for every node in the graph.
  • Figure 4: Overview of the SimSGT's framework.
  • Figure 5: GTS encoder. Tokenizer is a single-layer SGT of GIN.
  • ...and 6 more figures