How Graph Neural Networks Learn: Lessons from Training Dynamics
Chenxiao Yang, Qitian Wu, David Wipf, Ruoyu Sun, Junchi Yan
TL;DR
The paper investigates how Graph Neural Networks learn during optimization by analyzing training dynamics in function space through the Neural Tangent Kernel (NTK). It reveals kernel-graph alignment, where the evolving NTK $\boldsymbol{\Theta}_t$ tends to align with the graph adjacency $\mathbf A$, biasing learning along graph structure and explaining generalization patterns, especially on homophilic graphs. A key contribution is the Residual Propagation (RP) algorithm, a parameter-free, graph-guided update that can match or exceed GNN performance with substantial speed/memory advantages, and is linked to classic label propagation and kernel methods. The authors provide theoretical analysis in the overparameterized regime, derive explicit node-level GNTK expressions showing how $\mathbf A$ shapes learning, and validate findings empirically on real and synthetic data, including insights into heterophily where standard GNNs may underperform. Overall, the work offers interpretable, kernel-based explanations for GNN generalization, along with a simple, scalable alternative for graph-based learning.
Abstract
A long-standing goal in deep learning has been to characterize the learning behavior of black-box models in a more interpretable manner. For graph neural networks (GNNs), considerable advances have been made in formalizing what functions they can represent, but whether GNNs will learn desired functions during the optimization process remains less clear. To fill this gap, we study their training dynamics in function space. In particular, we find that the gradient descent optimization of GNNs implicitly leverages the graph structure to update the learned function, as can be quantified by a phenomenon which we call \emph{kernel-graph alignment}. We provide theoretical explanations for the emergence of this phenomenon in the overparameterized regime and empirically validate it on real-world GNNs. This finding offers new interpretable insights into when and why the learned GNN functions generalize, highlighting their limitations in heterophilic graphs. Practically, we propose a parameter-free algorithm that directly uses a sparse matrix (i.e. graph adjacency) to update the learned function. We demonstrate that this embarrassingly simple approach can be as effective as GNNs while being orders-of-magnitude faster.
