Spectral Greedy Coresets for Graph Neural Networks
Mucong Ding, Yinhan He, Jundong Li, Furong Huang
TL;DR
This paper tackles training efficiency for Graph Neural Networks on large-scale graphs by introducing Spectral Greedy Graph Coresets (SGGC), which select ego-graphs around center nodes in the graph spectral domain to approximate full-graph training loss. SGGC uses a two-stage greedy framework: a coarse spreading step via GIGA to approximate the spectral embedding and a refinement step via submodular maximization (CRAIG) to diversify topology; diffusion ego-graphs and PCA compression enable scalable coreset construction without pre-training. The authors provide a theoretical bound on the node-classification loss under a bounded-spectral-variance assumption and demonstrate strong empirical performance on ten datasets, including very large graphs, often outperforming model-based coresets and graph condensation while running faster and generalizing across GNN architectures. This work enables scalable, architecture-agnostic data condensation for GNNs, reducing training time and memory with minimal loss in accuracy, though it assumes smooth spectral embeddings and has $O(c n_t n)$ time complexity in the coreset construction.
Abstract
The ubiquity of large-scale graphs in node-classification tasks significantly hinders the real-world applications of Graph Neural Networks (GNNs). Node sampling, graph coarsening, and dataset condensation are effective strategies for enhancing data efficiency. However, owing to the interdependence of graph nodes, coreset selection, which selects subsets of the data examples, has not been successfully applied to speed up GNN training on large graphs, warranting special treatment. This paper studies graph coresets for GNNs and avoids the interdependence issue by selecting ego-graphs (i.e., neighborhood subgraphs around a node) based on their spectral embeddings. We decompose the coreset selection problem for GNNs into two phases: a coarse selection of widely spread ego graphs and a refined selection to diversify their topologies. We design a greedy algorithm that approximately optimizes both objectives. Our spectral greedy graph coreset (SGGC) scales to graphs with millions of nodes, obviates the need for model pre-training, and applies to low-homophily graphs. Extensive experiments on ten datasets demonstrate that SGGC outperforms other coreset methods by a wide margin, generalizes well across GNN architectures, and is much faster than graph condensation.
