Table of Contents
Fetching ...

Graph Sampling for Scalable and Expressive Graph Neural Networks on Homophilic Graphs

Haolin Li, Haoyu Wang, Luana Ruiz

TL;DR

The paper tackles scaling GNNs to large graphs by addressing the drawbacks of random subgraph sampling, which can disrupt connectivity and reduce expressive power. It introduces a feature-homophily-based sampling method that minimizes $tr(XX^T)$ to better preserve the graph Laplacian trace $tr(\mathbf{L})$, offering $O(d|E|)$ complexity and avoiding sequential node removals. Key contributions include a formal definition of feature homophily, a provable lower bound linking $tr(\mathbf{L})$ to $h_G$, and Algorithm 1 for efficient subgraph selection with favorable complexity relative to spectral methods; empirical results on citation networks demonstrate improved Laplacian-trace preservation and GNN transferability. The approach provides a practical pathway to scalable, expressive GNNs on large, homophilic graphs and connects to leverage-score concepts and graph sparsification for potential broader impact.

Abstract

Graph Neural Networks (GNNs) excel in many graph machine learning tasks but face challenges when scaling to large networks. GNN transferability allows training on smaller graphs and applying the model to larger ones, but existing methods often rely on random subsampling, leading to disconnected subgraphs and reduced model expressivity. We propose a novel graph sampling algorithm that leverages feature homophily to preserve graph structure. By minimizing the trace of the data correlation matrix, our method better preserves the graph Laplacian trace -- a proxy for the graph connectivity -- than random sampling, while achieving lower complexity than spectral methods. Experiments on citation networks show improved performance in preserving Laplacian trace and GNN transferability compared to random sampling.

Graph Sampling for Scalable and Expressive Graph Neural Networks on Homophilic Graphs

TL;DR

The paper tackles scaling GNNs to large graphs by addressing the drawbacks of random subgraph sampling, which can disrupt connectivity and reduce expressive power. It introduces a feature-homophily-based sampling method that minimizes to better preserve the graph Laplacian trace , offering complexity and avoiding sequential node removals. Key contributions include a formal definition of feature homophily, a provable lower bound linking to , and Algorithm 1 for efficient subgraph selection with favorable complexity relative to spectral methods; empirical results on citation networks demonstrate improved Laplacian-trace preservation and GNN transferability. The approach provides a practical pathway to scalable, expressive GNNs on large, homophilic graphs and connects to leverage-score concepts and graph sparsification for potential broader impact.

Abstract

Graph Neural Networks (GNNs) excel in many graph machine learning tasks but face challenges when scaling to large networks. GNN transferability allows training on smaller graphs and applying the model to larger ones, but existing methods often rely on random subsampling, leading to disconnected subgraphs and reduced model expressivity. We propose a novel graph sampling algorithm that leverages feature homophily to preserve graph structure. By minimizing the trace of the data correlation matrix, our method better preserves the graph Laplacian trace -- a proxy for the graph connectivity -- than random sampling, while achieving lower complexity than spectral methods. Experiments on citation networks show improved performance in preserving Laplacian trace and GNN transferability compared to random sampling.

Paper Structure

This paper contains 9 sections, 4 theorems, 12 equations, 3 figures, 1 algorithm.

Key Result

Theorem 2.1

Let $\Phi$ be a GNN with fixed coefficients, and $G_n$, $G_m$ graphs with $n$ and $m$ nodes sampled from a graphon $\mathbf{W}$. Under mild conditions, w.h.p.,

Figures (3)

  • Figure 1: Adjusted Laplacian trace versus graph subsampling rate. The adjusted trace is the subsampled graph Laplacian trace normalized by the number of sampled nodes. Boxplots indicate the median, first and third quartiles, minimum and maximum, and outliers of the trace, obtained from 50 rounds of random node subsampling; red dots are the trace of subgraphs generated using our sampling heuristic (Algorithm 1).
  • Figure 2: Example of randomly sampled subgraph (green) and subgraph sampled using Algorithm 1. Both graphs have 800 nodes and were sampled from the PubMed citation network, which has 19,717 nodes.
  • Figure 3: Test accuracy achieved by GNN on full graph versus training graph subsampling rate. Error bars indicate the standard deviation of the accuracy realized by GNNs trained on 50 random node-induced subgraphs; red dots are the test accuracy of GNNs trained on subgraphs produced by our heuristic (Algorithm 1).

Theorems & Definitions (11)

  • Theorem 2.1: GNN transferability, simplified ruiz2021transferability
  • Proposition 2.2: Expressivity of Graph Convolution
  • proof
  • Definition 3.1: Feature Homophily
  • Proposition 3.2: Lower Bound on ${tr}({\mathbf L})$
  • proof
  • Proposition 3.3: Complexity of Algorithm 1
  • proof
  • proof : Proof of Prop. II.2
  • proof : Proof of Prop. III.2
  • ...and 1 more