Table of Contents
Fetching ...

Graph VQ-Transformer (GVT): Fast and Accurate Molecular Generation via High-Fidelity Discrete Latents

Haozhuo Zheng, Cheng Wang, Yang Liu

TL;DR

GVT introduces a two-stage molecular generator that first learns high-fidelity discrete latent representations of molecular graphs via a Graph VQ-VAE, then trains an autoregressive Transformer on these latents to generate new molecules. The decoder leverages Reverse Cuthill-McKee canonical ordering and Rotary Position Embeddings to achieve near-perfect graph reconstruction, addressing structural ambiguity. On benchmarks like QM9, ZINC250k, MOSES, and GuacaMol, GVT attains state-of-the-art or competitive performance, with strong distribution-similarity metrics (FCD, KL) and significantly faster generation than diffusion methods. By reframing graph generation as discrete latent sequence modeling, GVT provides a scalable, efficient alternative that aligns molecular design with large-scale language-model paradigms and sets a strong baseline for future discrete-latent molecular generation.

Abstract

The de novo generation of molecules with desirable properties is a critical challenge, where diffusion models are computationally intensive and autoregressive models struggle with error propagation. In this work, we introduce the Graph VQ-Transformer (GVT), a two-stage generative framework that achieves both high accuracy and efficiency. The core of our approach is a novel Graph Vector Quantized Variational Autoencoder (VQ-VAE) that compresses molecular graphs into high-fidelity discrete latent sequences. By synergistically combining a Graph Transformer with canonical Reverse Cuthill-McKee (RCM) node ordering and Rotary Positional Embeddings (RoPE), our VQ-VAE achieves near-perfect reconstruction rates. An autoregressive Transformer is then trained on these discrete latents, effectively converting graph generation into a well-structured sequence modeling problem. Crucially, this mapping of complex graphs to high-fidelity discrete sequences bridges molecular design with the powerful paradigm of large-scale sequence modeling, unlocking potential synergies with Large Language Models (LLMs). Extensive experiments show that GVT achieves state-of-the-art or highly competitive performance across major benchmarks like ZINC250k, MOSES, and GuacaMol, and notably outperforms leading diffusion models on key distribution similarity metrics such as FCD and KL Divergence. With its superior performance, efficiency, and architectural novelty, GVT not only presents a compelling alternative to diffusion models but also establishes a strong new baseline for the field, paving the way for future research in discrete latent-space molecular generation.

Graph VQ-Transformer (GVT): Fast and Accurate Molecular Generation via High-Fidelity Discrete Latents

TL;DR

GVT introduces a two-stage molecular generator that first learns high-fidelity discrete latent representations of molecular graphs via a Graph VQ-VAE, then trains an autoregressive Transformer on these latents to generate new molecules. The decoder leverages Reverse Cuthill-McKee canonical ordering and Rotary Position Embeddings to achieve near-perfect graph reconstruction, addressing structural ambiguity. On benchmarks like QM9, ZINC250k, MOSES, and GuacaMol, GVT attains state-of-the-art or competitive performance, with strong distribution-similarity metrics (FCD, KL) and significantly faster generation than diffusion methods. By reframing graph generation as discrete latent sequence modeling, GVT provides a scalable, efficient alternative that aligns molecular design with large-scale language-model paradigms and sets a strong baseline for future discrete-latent molecular generation.

Abstract

The de novo generation of molecules with desirable properties is a critical challenge, where diffusion models are computationally intensive and autoregressive models struggle with error propagation. In this work, we introduce the Graph VQ-Transformer (GVT), a two-stage generative framework that achieves both high accuracy and efficiency. The core of our approach is a novel Graph Vector Quantized Variational Autoencoder (VQ-VAE) that compresses molecular graphs into high-fidelity discrete latent sequences. By synergistically combining a Graph Transformer with canonical Reverse Cuthill-McKee (RCM) node ordering and Rotary Positional Embeddings (RoPE), our VQ-VAE achieves near-perfect reconstruction rates. An autoregressive Transformer is then trained on these discrete latents, effectively converting graph generation into a well-structured sequence modeling problem. Crucially, this mapping of complex graphs to high-fidelity discrete sequences bridges molecular design with the powerful paradigm of large-scale sequence modeling, unlocking potential synergies with Large Language Models (LLMs). Extensive experiments show that GVT achieves state-of-the-art or highly competitive performance across major benchmarks like ZINC250k, MOSES, and GuacaMol, and notably outperforms leading diffusion models on key distribution similarity metrics such as FCD and KL Divergence. With its superior performance, efficiency, and architectural novelty, GVT not only presents a compelling alternative to diffusion models but also establishes a strong new baseline for the field, paving the way for future research in discrete latent-space molecular generation.

Paper Structure

This paper contains 20 sections, 7 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: An overview of our proposed two-stage framework. Stage 1: Graph VQ-VAE. A molecular graph is first preprocessed with Reverse Cuthill-McKee (RCM) for canonical node ordering. The Graph Transformer-based Encoder maps the graph to continuous latent vectors, which are then quantized into a sequence of discrete codes by the Vector Quantization (VQ) layer. The Decoder, which uniquely uses RoPE to interpret sequential proximity as structural information, reconstructs the graph from these codes. The model is trained end-to-end via a reconstruction and commitment loss. Stage 2: Autoregressive Generation. The trained VQ-VAE is used to encode a dataset of molecules into discrete latent sequences. A decoder-only autoregressive Transformer is then trained on these sequences. New molecules are generated by sampling a latent sequence from the AR model and decoding it back into a graph using the pre-trained VQ-VAE decoder.
  • Figure 2: An example of structural ambiguity. Three distinct oxygen atoms are encoded into the same discrete latent code ($k=211$).
  • Figure 3: 0-Error Reconstruction Rate (%) on test sets. Our full model (GVT with RoPE) achieves near-perfect reconstruction, drastically outperforming both the previous DGAE's GAE and our own architecture without the crucial RoPE component, especially on complex datasets.
  • Figure 4: Comparison of generation time for sampling 10,000 molecules on the QM9 dataset. All models were benchmarked on an NVIDIA RTX 4090 GPU. The x-axis is on a logarithmic scale to better visualize the wide range of speeds. Our model shows a competitive generation time, significantly outperforming other diffusion-based methods.