Table of Contents
Fetching ...

Discrete Graph Auto-Encoder

Yoann Boget, Magda Gregorova, Alexandros Kalousis

TL;DR

This paper addresses graph generation when no canonical node ordering exists by introducing Discrete Graph Auto-Encoder (DGAE), which first maps graphs to sets of discrete node embeddings via a permutation-equivariant encoder and then models their distribution by sorting into sequences and applying a 2D autoregressive Transformer. The two-stage framework leverages feature augmentation with $p$-path features and partitioned vector quantization to create a discrete latent space with known support, enabling efficient learning of the latent distribution. Experiments on simple graphs and molecular datasets show state-of-the-art-like performance on distributional metrics (e.g., NSPDK, FCD) and substantial generation speed gains over baselines, with ablations validating the benefits of the proposed augmentations and codebook design. The work contributes a novel combination of graph-to-set encoding, discrete latent modeling, and a two-dimensional Transformer for graph generation, offering a scalable and effective path for generic graph synthesis beyond domain-specific representations.

Abstract

Despite advances in generative methods, accurately modeling the distribution of graphs remains a challenging task primarily because of the absence of predefined or inherent unique graph representation. Two main strategies have emerged to tackle this issue: 1) restricting the number of possible representations by sorting the nodes, or 2) using permutation-invariant/equivariant functions, specifically Graph Neural Networks (GNNs). In this paper, we introduce a new framework named Discrete Graph Auto-Encoder (DGAE), which leverages the strengths of both strategies and mitigate their respective limitations. In essence, we propose a strategy in 2 steps. We first use a permutation-equivariant auto-encoder to convert graphs into sets of discrete latent node representations, each node being represented by a sequence of quantized vectors. In the second step, we sort the sets of discrete latent representations and learn their distribution with a specifically designed auto-regressive model based on the Transformer architecture. Through multiple experimental evaluations, we demonstrate the competitive performances of our model in comparison to the existing state-of-the-art across various datasets. Various ablation studies support the interest of our method.

Discrete Graph Auto-Encoder

TL;DR

This paper addresses graph generation when no canonical node ordering exists by introducing Discrete Graph Auto-Encoder (DGAE), which first maps graphs to sets of discrete node embeddings via a permutation-equivariant encoder and then models their distribution by sorting into sequences and applying a 2D autoregressive Transformer. The two-stage framework leverages feature augmentation with -path features and partitioned vector quantization to create a discrete latent space with known support, enabling efficient learning of the latent distribution. Experiments on simple graphs and molecular datasets show state-of-the-art-like performance on distributional metrics (e.g., NSPDK, FCD) and substantial generation speed gains over baselines, with ablations validating the benefits of the proposed augmentations and codebook design. The work contributes a novel combination of graph-to-set encoding, discrete latent modeling, and a two-dimensional Transformer for graph generation, offering a scalable and effective path for generic graph synthesis beyond domain-specific representations.

Abstract

Despite advances in generative methods, accurately modeling the distribution of graphs remains a challenging task primarily because of the absence of predefined or inherent unique graph representation. Two main strategies have emerged to tackle this issue: 1) restricting the number of possible representations by sorting the nodes, or 2) using permutation-invariant/equivariant functions, specifically Graph Neural Networks (GNNs). In this paper, we introduce a new framework named Discrete Graph Auto-Encoder (DGAE), which leverages the strengths of both strategies and mitigate their respective limitations. In essence, we propose a strategy in 2 steps. We first use a permutation-equivariant auto-encoder to convert graphs into sets of discrete latent node representations, each node being represented by a sequence of quantized vectors. In the second step, we sort the sets of discrete latent representations and learn their distribution with a specifically designed auto-regressive model based on the Transformer architecture. Through multiple experimental evaluations, we demonstrate the competitive performances of our model in comparison to the existing state-of-the-art across various datasets. Various ablation studies support the interest of our method.
Paper Structure (70 sections, 18 equations, 13 figures, 16 tables)

This paper contains 70 sections, 18 equations, 13 figures, 16 tables.

Figures (13)

  • Figure 1: Diagram of our auto-encoder. 1. The encoder is an MPNN transforming the graph into a set of node embeddings $\mathcal{Z}^h$. 2. The elements of the set $\mathcal{Z}^h$ are partitioned and quantized, producing a set of codeword sequences $\mathcal{Z}^q$. 3. The decoder, an other MPNN, takes the set $\mathcal{Z}^q$ and reconstruct the original graph.
  • Figure 2: Diagram of the quantization. We represent each node embedding by $C$ partition vectors ${\bm{z}}^h_{i, c}$. Then, we quantize each of these vectors by replacing them with their closest neighbor from the corresponding codebook $H_c$. The vectors in the codebooks are parameters learned during training.
  • Figure 3: The lines represent the average over three runs and the shaded area the standard deviation.
  • Figure 4: Effect of the codebook size and the partitioning on the dictionary usage. We report the normalized perplexity averaged over three runs. The black lines indicate the standard deviations.
  • Figure 5: Effect of the codebook size and the partitioning on reconstruction (left) and generation (right). We report the best reconstruction loss and the best NSPDK averaged over 3 runs. The black lines indicate the standard deviations.
  • ...and 8 more figures