A Graph is Worth $K$ Words: Euclideanizing Graph using Pure Transformer
Zhangyang Gao, Daize Dong, Cheng Tan, Jun Xia, Bozhen Hu, Stan Z. Li
TL;DR
GraphsGPT introduces a pure-Transformer pipeline that converts Non-Euclidean graphs into a fixed-length Euclidean sequence of Graph Words via Graph2Seq and then reconstructs the original graph with GraphGPT. The edge-centric generation and decoupled Graph Position Encodings enable end-to-end representation and generation, trained through a GPT-style self-supervised objective on ~100M molecules. Pretraining yields state-of-the-art performance on multiple MoleculeNet tasks for representation, while enabling few-shot and controllable graph generation and Euclidean space graph mixup. The framework demonstrates permutation robustness and opens a new paradigm for transforming graph data into and from Euclidean latent spaces for manipulation and optimization.
Abstract
Can we model Non-Euclidean graphs as pure language or even Euclidean vectors while retaining their inherent information? The Non-Euclidean property have posed a long term challenge in graph modeling. Despite recent graph neural networks and graph transformers efforts encoding graphs as Euclidean vectors, recovering the original graph from vectors remains a challenge. In this paper, we introduce GraphsGPT, featuring an Graph2Seq encoder that transforms Non-Euclidean graphs into learnable Graph Words in the Euclidean space, along with a GraphGPT decoder that reconstructs the original graph from Graph Words to ensure information equivalence. We pretrain GraphsGPT on $100$M molecules and yield some interesting findings: (1) The pretrained Graph2Seq excels in graph representation learning, achieving state-of-the-art results on $8/9$ graph classification and regression tasks. (2) The pretrained GraphGPT serves as a strong graph generator, demonstrated by its strong ability to perform both few-shot and conditional graph generation. (3) Graph2Seq+GraphGPT enables effective graph mixup in the Euclidean space, overcoming previously known Non-Euclidean challenges. (4) The edge-centric pretraining framework GraphsGPT demonstrates its efficacy in graph domain tasks, excelling in both representation and generation. Code is available at \href{https://github.com/A4Bio/GraphsGPT}{GitHub}.
