GraphBPE: Molecular Graphs Meet Byte-Pair Encoding
Yuchen Shen, Barnabás Póczos
TL;DR
GraphBPE proposes a data-driven graph tokenization method inspired by Byte-Pair Encoding to decompose molecular graphs into substructures, producing either simple graphs or hypergraphs as inputs to GNNs and HyperGNNs. By iteratively contracting high-frequency contextualized node pairs, GraphBPE creates a vocabulary of graph substructures that can reshape learning, with ring and clique topologies serving as common preprocessing targets. Across six datasets and multiple architectures, tokenization generally benefits small classification tasks and remains competitive with other tokenizers on larger datasets, underscoring the impact of data preprocessing schedules in molecular graph learning. The work demonstrates that a model-agnostic preprocessing step can uncover effective graph representations, suggesting that careful tokenization strategies are a valuable complement to architectural advances in molecular ML, and provides a public implementation to spur further exploration.
Abstract
With the increasing attention to molecular machine learning, various innovations have been made in designing better models or proposing more comprehensive benchmarks. However, less is studied on the data preprocessing schedule for molecular graphs, where a different view of the molecular graph could potentially boost the model's performance. Inspired by the Byte-Pair Encoding (BPE) algorithm, a subword tokenization method popularly adopted in Natural Language Processing, we propose GraphBPE, which tokenizes a molecular graph into different substructures and acts as a preprocessing schedule independent of the model architectures. Our experiments on 3 graph-level classification and 3 graph-level regression datasets show that data preprocessing could boost the performance of models for molecular graphs, and GraphBPE is effective for small classification datasets and it performs on par with other tokenization methods across different model architectures.
