Table of Contents
Fetching ...

Diffusion on Graph: Augmentation of Graph Structure for Node Classification

Yancheng Wang, Changyu Liu, Yingzhen Yang

TL;DR

Diffusion on Graph (DoG) addresses the challenge of augmenting node-level learning by generating synthetic nodes and their internal connections within a single graph. It combines a Graph Autoencoder (GAE) with a Latent Diffusion Model (LDM) trained via Classifier-Free Guidance (CFG) in latent space, and uses a Bi-Level Neighborhood Decoder (BLND) to efficiently reconstruct edges, forming an augmented graph. To combat diffusion-induced noise, DoG introduces a low-rank regularization term based on a truncated nuclear norm, with theoretical guarantees on test loss, and demonstrates substantial improvements on node classification and graph contrastive learning across multiple benchmarks, including large-scale graphs. The method is designed to be orthogonal to existing node-level augmentation techniques and is accompanied by an open-source implementation, indicating practical potential for enhancing graph-based learning in diverse domains.

Abstract

Graph diffusion models have recently been proposed to synthesize entire graphs, such as molecule graphs. Although existing methods have shown great performance in generating entire graphs for graph-level learning tasks, no graph diffusion models have been developed to generate synthetic graph structures, that is, synthetic nodes and associated edges within a given graph, for node-level learning tasks. Inspired by the research in the computer vision literature using synthetic data for enhanced performance, we propose Diffusion on Graph (DoG), which generates synthetic graph structures to boost the performance of GNNs. The synthetic graph structures generated by DoG are combined with the original graph to form an augmented graph for the training of node-level learning tasks, such as node classification and graph contrastive learning (GCL). To improve the efficiency of the generation process, a Bi-Level Neighbor Map Decoder (BLND) is introduced in DoG. To mitigate the adverse effect of the noise introduced by the synthetic graph structures, a low-rank regularization method is proposed for the training of graph neural networks (GNNs) on the augmented graphs. Extensive experiments on various graph datasets for semi-supervised node classification and graph contrastive learning have been conducted to demonstrate the effectiveness of DoG with low-rank regularization. The code of DoG is available at https://github.com/Statistical-Deep-Learning/DoG.

Diffusion on Graph: Augmentation of Graph Structure for Node Classification

TL;DR

Diffusion on Graph (DoG) addresses the challenge of augmenting node-level learning by generating synthetic nodes and their internal connections within a single graph. It combines a Graph Autoencoder (GAE) with a Latent Diffusion Model (LDM) trained via Classifier-Free Guidance (CFG) in latent space, and uses a Bi-Level Neighborhood Decoder (BLND) to efficiently reconstruct edges, forming an augmented graph. To combat diffusion-induced noise, DoG introduces a low-rank regularization term based on a truncated nuclear norm, with theoretical guarantees on test loss, and demonstrates substantial improvements on node classification and graph contrastive learning across multiple benchmarks, including large-scale graphs. The method is designed to be orthogonal to existing node-level augmentation techniques and is accompanied by an open-source implementation, indicating practical potential for enhancing graph-based learning in diverse domains.

Abstract

Graph diffusion models have recently been proposed to synthesize entire graphs, such as molecule graphs. Although existing methods have shown great performance in generating entire graphs for graph-level learning tasks, no graph diffusion models have been developed to generate synthetic graph structures, that is, synthetic nodes and associated edges within a given graph, for node-level learning tasks. Inspired by the research in the computer vision literature using synthetic data for enhanced performance, we propose Diffusion on Graph (DoG), which generates synthetic graph structures to boost the performance of GNNs. The synthetic graph structures generated by DoG are combined with the original graph to form an augmented graph for the training of node-level learning tasks, such as node classification and graph contrastive learning (GCL). To improve the efficiency of the generation process, a Bi-Level Neighbor Map Decoder (BLND) is introduced in DoG. To mitigate the adverse effect of the noise introduced by the synthetic graph structures, a low-rank regularization method is proposed for the training of graph neural networks (GNNs) on the augmented graphs. Extensive experiments on various graph datasets for semi-supervised node classification and graph contrastive learning have been conducted to demonstrate the effectiveness of DoG with low-rank regularization. The code of DoG is available at https://github.com/Statistical-Deep-Learning/DoG.

Paper Structure

This paper contains 33 sections, 1 theorem, 9 equations, 6 figures, 13 tables, 2 algorithms.

Key Result

Theorem A.1

Let $m \ge cN$ for a constant $c\in (0,1)$, and $r_0 \in [N]$. Assume that a set ${\cal V}_L \subseteq {\cal V}$ with $\left | {\cal V}_L \right | = m$ is sampled uniformly without replacement from ${\cal V}$ as the labeled training nodes, and the remaining nodes ${\cal V}_U = {\cal V} \setmin

Figures (6)

  • Figure 1: Illustration of the augmented graph after adding the synthetic graph structures to the original graph.
  • Figure 2: Node classification accuracy of GCN on Cora and Citeseer trained with different numbers ($N'$) of synthetic nodes added. ${\cal V}_L$ is the set of labeled training nodes in the original graph.
  • Figure 3: The overall framework of synthetic node generation process. The structure of the Bi-level Neighbor Map Decoder (BLND) is illustrated in Figure \ref{['fig:BLND']}.
  • Figure 4: The structure of the Bi-Level Neighborhood Decoder (BLND), which generates an inter-cluster neighbor map and an intra-cluster neighbor map for each node.
  • Figure 5: Eigen-projection (first row) and signal concentration ratio (second row) on the augmented graph for Cora, Citeseer, Pubmed, Coauthor-CS, and ogbn-arxiv. To compute the eigen-projection, we first calculate the eigenvectors $\mathbf{U}$ of the kernel gram matrix $\mathbf{K} \in \mathbb{R}^{\bar{N} \times \bar{N}}$ computed by a feature matrix $\mathbf{F} \in \mathbb{R}^{\bar{N} \times d}$, then the projection value is computed by $\mathbf{p} = \frac{1}{C}\sum_{c=1}^{C} {\mathbf{U}}^{\top} \mathbf{Y}^{(c)}/ {\left\| \mathbf{Y}^{(c)}\right\|}_{2}^2 \in \mathbb{R}^n$, where $C$ is the number of classes, and $\mathbf{Y}\in\{0,1\}^{\bar{N} \times C}$ is the one-hot labels of all the training data in the augmented graph, $\mathbf{Y}^{(c)}$ is the $c$-th column of $\mathbf{Y}$. The eigen-projection $\mathbf{p}_{r}$ for $r \in [\min(\bar{N},d)]$ reflects the amount of the signal projected onto the $r$-th eigenvector of $\mathbf{K}$, and the signal concentration ratio of a rank $r$ reflects the proportion of signal projected onto the top $r$ eigenvectors of $\mathbf{K}$. The signal concentration ratio for rank $r$ is computed by ${\left\|\mathbf{p}^{(1:r)}\right\|}_{2}$, where $\mathbf{p}^{(1:r)}$ contains the first $r$ elements of $\mathbf{p}$. For example, by the rank $r=0.2\min\left\{\bar{N},d\right\}$, the signal concentration ratio of $\mathbf{Y}$ for Cora, Citeseer, and Pubmed are $0.844$, $0.809$, $0.784$, $0.779$, and $0.787$, respectively. We refer to such property as the low frequency property, which suggests that we can learn a low-rank portion of the observed label $\mathbf{Y}$, which covers most information in the ground truth clean label while only learning a small portion of the label noise.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Theorem A.1
  • proof