Table of Contents
Fetching ...

GraphBPE: Molecular Graphs Meet Byte-Pair Encoding

Yuchen Shen, Barnabás Póczos

TL;DR

GraphBPE proposes a data-driven graph tokenization method inspired by Byte-Pair Encoding to decompose molecular graphs into substructures, producing either simple graphs or hypergraphs as inputs to GNNs and HyperGNNs. By iteratively contracting high-frequency contextualized node pairs, GraphBPE creates a vocabulary of graph substructures that can reshape learning, with ring and clique topologies serving as common preprocessing targets. Across six datasets and multiple architectures, tokenization generally benefits small classification tasks and remains competitive with other tokenizers on larger datasets, underscoring the impact of data preprocessing schedules in molecular graph learning. The work demonstrates that a model-agnostic preprocessing step can uncover effective graph representations, suggesting that careful tokenization strategies are a valuable complement to architectural advances in molecular ML, and provides a public implementation to spur further exploration.

Abstract

With the increasing attention to molecular machine learning, various innovations have been made in designing better models or proposing more comprehensive benchmarks. However, less is studied on the data preprocessing schedule for molecular graphs, where a different view of the molecular graph could potentially boost the model's performance. Inspired by the Byte-Pair Encoding (BPE) algorithm, a subword tokenization method popularly adopted in Natural Language Processing, we propose GraphBPE, which tokenizes a molecular graph into different substructures and acts as a preprocessing schedule independent of the model architectures. Our experiments on 3 graph-level classification and 3 graph-level regression datasets show that data preprocessing could boost the performance of models for molecular graphs, and GraphBPE is effective for small classification datasets and it performs on par with other tokenization methods across different model architectures.

GraphBPE: Molecular Graphs Meet Byte-Pair Encoding

TL;DR

GraphBPE proposes a data-driven graph tokenization method inspired by Byte-Pair Encoding to decompose molecular graphs into substructures, producing either simple graphs or hypergraphs as inputs to GNNs and HyperGNNs. By iteratively contracting high-frequency contextualized node pairs, GraphBPE creates a vocabulary of graph substructures that can reshape learning, with ring and clique topologies serving as common preprocessing targets. Across six datasets and multiple architectures, tokenization generally benefits small classification tasks and remains competitive with other tokenizers on larger datasets, underscoring the impact of data preprocessing schedules in molecular graph learning. The work demonstrates that a model-agnostic preprocessing step can uncover effective graph representations, suggesting that careful tokenization strategies are a valuable complement to architectural advances in molecular ML, and provides a public implementation to spur further exploration.

Abstract

With the increasing attention to molecular machine learning, various innovations have been made in designing better models or proposing more comprehensive benchmarks. However, less is studied on the data preprocessing schedule for molecular graphs, where a different view of the molecular graph could potentially boost the model's performance. Inspired by the Byte-Pair Encoding (BPE) algorithm, a subword tokenization method popularly adopted in Natural Language Processing, we propose GraphBPE, which tokenizes a molecular graph into different substructures and acts as a preprocessing schedule independent of the model architectures. Our experiments on 3 graph-level classification and 3 graph-level regression datasets show that data preprocessing could boost the performance of models for molecular graphs, and GraphBPE is effective for small classification datasets and it performs on par with other tokenization methods across different model architectures.
Paper Structure (23 sections, 1 equation, 46 figures, 17 tables, 3 algorithms)

This paper contains 23 sections, 1 equation, 46 figures, 17 tables, 3 algorithms.

Figures (46)

  • Figure 1: The tokenization of a molecule from Mutag with its SMILES being "c1cc(c(cc1F)[N+](=O)[O-])F". We color the identified node sets at iteration $t=0, 1, 43$.
  • Figure 2: Results of a 3-layer GCN, GAT, GIN, and GraphSAGE with a learning rate of 0.01 and a hidden size of 32 on Mutag, Enzymes, and Proteins (1st, 2nd, 3rd row, respectively), with accuracy the higher the better. The x-axis denotes the number of tokenization steps in our GraphBPE algorithm. We plot $\mu\pm\sigma$ over 5 runs for each configuration.
  • Figure 3: Results of a 3-layer HyperConv, HGNN++, and HNHN (1st, 2nd, 3rd row, respectively) with a learning rate of {0.01, 0.001} and a hidden size of {32, 64} on Freesolv, with RMSE the lower the better. The x-axis denotes the number of tokenization steps in our GraphBPE algorithm. We plot $\mu\pm\sigma$ over 5 runs for GraphBPE, Centroid, and omit $\pm\sigma$ for Chem, H2g for better visualization.
  • Figure 4: Results of GCN on Mutag, with accuracy the higher the better
  • Figure 5: Results of GAT on Mutag, with accuracy the higher the better
  • ...and 41 more figures