Table of Contents
Fetching ...

How Graph Neural Networks Learn: Lessons from Training Dynamics

Chenxiao Yang, Qitian Wu, David Wipf, Ruoyu Sun, Junchi Yan

TL;DR

The paper investigates how Graph Neural Networks learn during optimization by analyzing training dynamics in function space through the Neural Tangent Kernel (NTK). It reveals kernel-graph alignment, where the evolving NTK $\boldsymbol{\Theta}_t$ tends to align with the graph adjacency $\mathbf A$, biasing learning along graph structure and explaining generalization patterns, especially on homophilic graphs. A key contribution is the Residual Propagation (RP) algorithm, a parameter-free, graph-guided update that can match or exceed GNN performance with substantial speed/memory advantages, and is linked to classic label propagation and kernel methods. The authors provide theoretical analysis in the overparameterized regime, derive explicit node-level GNTK expressions showing how $\mathbf A$ shapes learning, and validate findings empirically on real and synthetic data, including insights into heterophily where standard GNNs may underperform. Overall, the work offers interpretable, kernel-based explanations for GNN generalization, along with a simple, scalable alternative for graph-based learning.

Abstract

A long-standing goal in deep learning has been to characterize the learning behavior of black-box models in a more interpretable manner. For graph neural networks (GNNs), considerable advances have been made in formalizing what functions they can represent, but whether GNNs will learn desired functions during the optimization process remains less clear. To fill this gap, we study their training dynamics in function space. In particular, we find that the gradient descent optimization of GNNs implicitly leverages the graph structure to update the learned function, as can be quantified by a phenomenon which we call \emph{kernel-graph alignment}. We provide theoretical explanations for the emergence of this phenomenon in the overparameterized regime and empirically validate it on real-world GNNs. This finding offers new interpretable insights into when and why the learned GNN functions generalize, highlighting their limitations in heterophilic graphs. Practically, we propose a parameter-free algorithm that directly uses a sparse matrix (i.e. graph adjacency) to update the learned function. We demonstrate that this embarrassingly simple approach can be as effective as GNNs while being orders-of-magnitude faster.

How Graph Neural Networks Learn: Lessons from Training Dynamics

TL;DR

The paper investigates how Graph Neural Networks learn during optimization by analyzing training dynamics in function space through the Neural Tangent Kernel (NTK). It reveals kernel-graph alignment, where the evolving NTK tends to align with the graph adjacency , biasing learning along graph structure and explaining generalization patterns, especially on homophilic graphs. A key contribution is the Residual Propagation (RP) algorithm, a parameter-free, graph-guided update that can match or exceed GNN performance with substantial speed/memory advantages, and is linked to classic label propagation and kernel methods. The authors provide theoretical analysis in the overparameterized regime, derive explicit node-level GNTK expressions showing how shapes learning, and validate findings empirically on real and synthetic data, including insights into heterophily where standard GNNs may underperform. Overall, the work offers interpretable, kernel-based explanations for GNN generalization, along with a simple, scalable alternative for graph-based learning.

Abstract

A long-standing goal in deep learning has been to characterize the learning behavior of black-box models in a more interpretable manner. For graph neural networks (GNNs), considerable advances have been made in formalizing what functions they can represent, but whether GNNs will learn desired functions during the optimization process remains less clear. To fill this gap, we study their training dynamics in function space. In particular, we find that the gradient descent optimization of GNNs implicitly leverages the graph structure to update the learned function, as can be quantified by a phenomenon which we call \emph{kernel-graph alignment}. We provide theoretical explanations for the emergence of this phenomenon in the overparameterized regime and empirically validate it on real-world GNNs. This finding offers new interpretable insights into when and why the learned GNN functions generalize, highlighting their limitations in heterophilic graphs. Practically, we propose a parameter-free algorithm that directly uses a sparse matrix (i.e. graph adjacency) to update the learned function. We demonstrate that this embarrassingly simple approach can be as effective as GNNs while being orders-of-magnitude faster.
Paper Structure (59 sections, 7 theorems, 97 equations, 4 figures, 5 tables, 2 algorithms)

This paper contains 59 sections, 7 theorems, 97 equations, 4 figures, 5 tables, 2 algorithms.

Key Result

Proposition 3.1

The first step of RP in (eqn_rp) yields identical classification results as LP in (eqn_lp) (with $\alpha = 1$ and $k = K$): and each of subsequent step of RP can also be viewed as LP on adjusted ground-truth labels, i.e. $\mathbf Y - \mathbf F_t = \mathbf R_t$.

Figures (4)

  • Figure 1: (a) Training dynamics of GNNs in function space where residuals (i.e. difference between labels and predictions) propagate from observed to unobserved samples based on a kernel similarity measure. (b) The kernel matrix $\mathbf \Theta_t$ naturally aligns with the adjacency matrix $\mathbf A$, which is favorable for generalization if $\mathbf A$ is inherently close to the optimal kernel $\mathbf \Theta^*$.
  • Figure 2: Learning curves of RP and comparison with the performance of LP ($\alpha=1$), linear GNN and deep GNN. Transition from yellow to purple denotes RP with decreasing step size $\eta$.
  • Figure 3: Evolution of NTK matrix $\mathbf \Theta_t$ of GCN during training, reflected by matrix alignment. (a) Synthetic dataset generated by a stochastic block model, where the homophily level gradually decreases by altering edge probabilities, i.e. homophilic$\rightarrow$heterophilic; (b & c) Real-world homophilic (Cora) and heterophilic (Texas) datasets, where the graph is gradually coarsened until there is no edge left when evaluating $\mathbf \Theta_t$, i.e. more graph$\rightarrow$less graph. (Details in Appendix \ref{['app_detail2']})
  • Figure 4: Visualization of NTK of well-trained GCN on a node classification benchmark (Cora). $50$ nodes are randomly selected for clearity. From left to right are $c\times c$, $n\times n$, $nc\times nc$ NTK matrices, where the former two matrices are obtained by averaging the $nc\times nc$ NTK matrix at dimension $n$ and $c$ respectively. The diagonal patterns in the first and last matrix verifies that our analysis for finitely-wide GNNs in binary classification also applies to multi-class classification setting.

Theorems & Definitions (13)

  • Proposition 3.1: Connection with Label Propagation
  • Theorem 3.2: Convergence & Connection with Kernel Regression
  • Definition 4.1: Node-Level GNTK
  • Theorem 4.2: Two-Layer GNN
  • Theorem 4.3: Deep and Wide GNN Dynamics
  • Corollary 4.4: One-Layer GNN
  • Remark
  • Definition 5.1: Alignment, cristianini2001kernel
  • Theorem 5.2: Bayesian Optimality of GNN
  • Remark
  • ...and 3 more