Spectral Greedy Coresets for Graph Neural Networks

Mucong Ding; Yinhan He; Jundong Li; Furong Huang

Spectral Greedy Coresets for Graph Neural Networks

Mucong Ding, Yinhan He, Jundong Li, Furong Huang

TL;DR

This paper tackles training efficiency for Graph Neural Networks on large-scale graphs by introducing Spectral Greedy Graph Coresets (SGGC), which select ego-graphs around center nodes in the graph spectral domain to approximate full-graph training loss. SGGC uses a two-stage greedy framework: a coarse spreading step via GIGA to approximate the spectral embedding and a refinement step via submodular maximization (CRAIG) to diversify topology; diffusion ego-graphs and PCA compression enable scalable coreset construction without pre-training. The authors provide a theoretical bound on the node-classification loss under a bounded-spectral-variance assumption and demonstrate strong empirical performance on ten datasets, including very large graphs, often outperforming model-based coresets and graph condensation while running faster and generalizing across GNN architectures. This work enables scalable, architecture-agnostic data condensation for GNNs, reducing training time and memory with minimal loss in accuracy, though it assumes smooth spectral embeddings and has $O(c n_t n)$ time complexity in the coreset construction.

Abstract

The ubiquity of large-scale graphs in node-classification tasks significantly hinders the real-world applications of Graph Neural Networks (GNNs). Node sampling, graph coarsening, and dataset condensation are effective strategies for enhancing data efficiency. However, owing to the interdependence of graph nodes, coreset selection, which selects subsets of the data examples, has not been successfully applied to speed up GNN training on large graphs, warranting special treatment. This paper studies graph coresets for GNNs and avoids the interdependence issue by selecting ego-graphs (i.e., neighborhood subgraphs around a node) based on their spectral embeddings. We decompose the coreset selection problem for GNNs into two phases: a coarse selection of widely spread ego graphs and a refined selection to diversify their topologies. We design a greedy algorithm that approximately optimizes both objectives. Our spectral greedy graph coreset (SGGC) scales to graphs with millions of nodes, obviates the need for model pre-training, and applies to low-homophily graphs. Extensive experiments on ten datasets demonstrate that SGGC outperforms other coreset methods by a wide margin, generalizes well across GNN architectures, and is much faster than graph condensation.

Spectral Greedy Coresets for Graph Neural Networks

TL;DR

time complexity in the coreset construction.

Abstract

Paper Structure (23 sections, 7 theorems, 18 equations, 6 figures, 17 tables, 1 algorithm)

This paper contains 23 sections, 7 theorems, 18 equations, 6 figures, 17 tables, 1 algorithm.

Introduction
Problem: Graph Coresets for GNNs
Spectral Greedy Graph Coresets
Graph Node-wise Average Coresets
Spectral Linear Classification Coresets
Algorithm and Theoretical Analysis
Related Work
Experiments
Conclusions
Proofs and More Theoretical Discussions
Proofs for Section 3.1
Proofs for Section 3.2
Proofs for Section 4
Compressing Ego-Graph's Node Features via PCA
Message-Passing GNNs
...and 8 more sections

Key Result

Theorem 1

Under all assumptions of prop:smooth-embedding, we have $\|\sum_{i\in [n_{t}]}w^\mathtt{a}_i\cdot \widetilde{Z}_i - \widetilde{Z}\|_{F}\leq M\cdot \|P\mathbf{w}^\mathtt{a}-\frac{1}{n}\mathbbm{1}\|$ for some constant $M>0$.

Figures (6)

Figure 1: Overview of spectral greedy graph coresets (SGGC) for efficient GNN training. SGGC processes a large graph to iteratively select ego-graphs. The assembled coreset graph facilitates fast GNN training while maintaining test performance on the original graph.
Figure 2: Relative standard deviation of spectral embeddings on ego-graphs $\boldsymbol{Z}_i$ across all the nodes vs. the ego-graph size $p$; see \ref{['assump:bounded-spectral-variance']}.
Figure 3: Conceptual diagram showing the theoretical analysis formulating the spectral greedy graph coresets (SGGC).
Figure 4: Spectral response of 2-layer GCNs on Cora. The spectral response corresponding to eigenvalue $\lambda_i$ is defined as $\|[U^{\mkern-1.5mu\mathsf{T}} f_\theta(A,X)]_{i,:}\|/\|[U^{\mkern-1.5mu\mathsf{T}} X]_{i,:}\|$.
Figure 5: Test accuracy versus the selected data size of selecting nodes and diffusion ego-graphs with/without PCA-based compression of node attributes.
...and 1 more figures

Theorems & Definitions (13)

Theorem 1: Upper-bound on the Error Approximating Node-wise Average
proof
Theorem 2: Error-Bound on Node Classification Loss
proof
Lemma 3: Smoothness of the Spectral Representation of Ego-graph's Input Features
proof
Lemma 4: Lipschitzness of GCN in Spectral Domain
proof
Proposition 5: Smoothness of Spectral Embeddings
Theorem 6: Upper-bound on the Error Approximating Node-wise Average
...and 3 more

Spectral Greedy Coresets for Graph Neural Networks

TL;DR

Abstract

Spectral Greedy Coresets for Graph Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (13)