Table of Contents
Fetching ...

Contrastive Graph Condensation: Advancing Data Versatility through Self-Supervised Learning

Xinyi Gao, Yayong Li, Tong Chen, Guanhua Ye, Wentao Zhang, Hongzhi Yin

TL;DR

This work tackles graph condensation under label scarcity by introducing Contrastive Graph Condensation (CTGC), a self-supervised framework that disentangles semantic and structural information via a dual-branch relay architecture. The semantic branch processes node attributes while the structural branch encodes geometric information through spectral embeddings using EigenMLP, with an alternating optimization that aligns both branches through clustering-based contrastive losses. Graph generation is achieved through a model-inversion process, recovering a condensed topology and node attributes from the learned centroids, enabling label-free pre-training for diverse downstream tasks. Empirical results on multiple datasets show CTGC consistently outperforms state-of-the-art GC methods, especially at high condensation ratios, and demonstrates strong generalization across GNN architectures and tasks such as node classification, link prediction, and clustering.

Abstract

With the increasing computation of training graph neural networks (GNNs) on large-scale graphs, graph condensation (GC) has emerged as a promising solution to synthesize a compact, substitute graph of the large-scale original graph for efficient GNN training. However, existing GC methods predominantly employ classification as the surrogate task for optimization, thus excessively relying on node labels and constraining their utility in label-sparsity scenarios. More critically, this surrogate task tends to overfit class-specific information within the condensed graph, consequently restricting the generalization capabilities of GC for other downstream tasks. To address these challenges, we introduce Contrastive Graph Condensation (CTGC), which adopts a self-supervised surrogate task to extract critical, causal information from the original graph and enhance the cross-task generalizability of the condensed graph. Specifically, CTGC employs a dual-branch framework to disentangle the generation of the node attributes and graph structures, where a dedicated structural branch is designed to explicitly encode geometric information through nodes' positional embeddings. By implementing an alternating optimization scheme with contrastive loss terms, CTGC promotes the mutual enhancement of both branches and facilitates high-quality graph generation through the model inversion technique. Extensive experiments demonstrate that CTGC excels in handling various downstream tasks with a limited number of labels, consistently outperforming state-of-the-art GC methods.

Contrastive Graph Condensation: Advancing Data Versatility through Self-Supervised Learning

TL;DR

This work tackles graph condensation under label scarcity by introducing Contrastive Graph Condensation (CTGC), a self-supervised framework that disentangles semantic and structural information via a dual-branch relay architecture. The semantic branch processes node attributes while the structural branch encodes geometric information through spectral embeddings using EigenMLP, with an alternating optimization that aligns both branches through clustering-based contrastive losses. Graph generation is achieved through a model-inversion process, recovering a condensed topology and node attributes from the learned centroids, enabling label-free pre-training for diverse downstream tasks. Empirical results on multiple datasets show CTGC consistently outperforms state-of-the-art GC methods, especially at high condensation ratios, and demonstrates strong generalization across GNN architectures and tasks such as node classification, link prediction, and clustering.

Abstract

With the increasing computation of training graph neural networks (GNNs) on large-scale graphs, graph condensation (GC) has emerged as a promising solution to synthesize a compact, substitute graph of the large-scale original graph for efficient GNN training. However, existing GC methods predominantly employ classification as the surrogate task for optimization, thus excessively relying on node labels and constraining their utility in label-sparsity scenarios. More critically, this surrogate task tends to overfit class-specific information within the condensed graph, consequently restricting the generalization capabilities of GC for other downstream tasks. To address these challenges, we introduce Contrastive Graph Condensation (CTGC), which adopts a self-supervised surrogate task to extract critical, causal information from the original graph and enhance the cross-task generalizability of the condensed graph. Specifically, CTGC employs a dual-branch framework to disentangle the generation of the node attributes and graph structures, where a dedicated structural branch is designed to explicitly encode geometric information through nodes' positional embeddings. By implementing an alternating optimization scheme with contrastive loss terms, CTGC promotes the mutual enhancement of both branches and facilitates high-quality graph generation through the model inversion technique. Extensive experiments demonstrate that CTGC excels in handling various downstream tasks with a limited number of labels, consistently outperforming state-of-the-art GC methods.

Paper Structure

This paper contains 26 sections, 16 equations, 4 figures, 9 tables, 1 algorithm.

Figures (4)

  • Figure 1: The framework of our proposed CTGC, which comprises two stages: relay model training and graph generation. (1) CTGC employs a dual-branch architecture to separately extract semantic and structural information. The semantic relay model processes both the graph structure and node attributes, while the structural relay model uses eigenvalues and eigenvectors as inputs. These branches are iteratively optimized using contrastive losses. (2) The condensed graph is generated using the model inversion technique. This process begins with generating eigenvectors to construct the condensed graph structure, followed by learning node attributes based on the constructed graph structure.
  • Figure 2: The effect of the alternating optimization on the Reddit ($r$=0.1%). Left: task performances. Right: accuracy of assignment alignment.
  • Figure 3: Node distribution of original and condensed graphs. Condensed nodes (black stars) are label-free, while original nodes are color-coded by class labels.
  • Figure 4: The effect of hyper-parameter $\alpha$ on the Cora ($r$=2.6%) and Reddit ($r$=0.1%) under the 3-shot setting.