Table of Contents
Fetching ...

Which Algorithms Can Graph Neural Networks Learn?

Solveig Wittig, Antonis Vasileiou, Robert R. Nerem, Timo Stoll, Floris Geerts, Yusu Wang, Christopher Morris

TL;DR

This work provides a principled theory for when graph neural networks can learn and generalize discrete graph algorithms. By linking algorithmic invariants to Lipschitz properties, covering numbers, and differentiable regularization, the authors characterize which algorithms can be learned from finite data and extrapolated to arbitrarily large graphs, including SSSP, MST, and DP style problems like the $0$-$1$ knapsack. They show both positive results for learnable classes (via normalized sum, mean, and min/max aggregations) and impossibility results for standard MPNNs on certain tasks, while offering more expressive architectures to overcome these limits. A key advance is a differentiable regularization approach that tightens Bellman--Ford extrapolation and enables explicit, small training sets with provable size generalization guarantees. The empirical study supports the theory, demonstrating practical gains in size generalization and the benefits of the proposed regularization in learning graph algorithms from limited data.

Abstract

In recent years, there has been growing interest in understanding neural architectures' ability to learn to execute discrete algorithms, a line of work often referred to as neural algorithmic reasoning. The goal is to integrate algorithmic reasoning capabilities into larger neural pipelines. Many such architectures are based on (message-passing) graph neural networks (MPNNs), owing to their permutation equivariance and ability to deal with sparsity and variable-sized inputs. However, existing work is either largely empirical and lacks formal guarantees or it focuses solely on expressivity, leaving open the question of when and how such architectures generalize beyond a finite training set. In this work, we propose a general theoretical framework that characterizes the sufficient conditions under which MPNNs can learn an algorithm from a training set of small instances and provably approximate its behavior on inputs of arbitrary size. Our framework applies to a broad class of algorithms, including single-source shortest paths, minimum spanning trees, and general dynamic programming problems, such as the $0$-$1$ knapsack problem. In addition, we establish impossibility results for a wide range of algorithmic tasks, showing that standard MPNNs cannot learn them, and we derive more expressive MPNN-like architectures that overcome these limitations. Finally, we refine our analysis for the Bellman-Ford algorithm, yielding a substantially smaller required training set and significantly extending the recent work of Nerem et al. [2025] by allowing for a differentiable regularization loss. Empirical results largely support our theoretical findings.

Which Algorithms Can Graph Neural Networks Learn?

TL;DR

This work provides a principled theory for when graph neural networks can learn and generalize discrete graph algorithms. By linking algorithmic invariants to Lipschitz properties, covering numbers, and differentiable regularization, the authors characterize which algorithms can be learned from finite data and extrapolated to arbitrarily large graphs, including SSSP, MST, and DP style problems like the - knapsack. They show both positive results for learnable classes (via normalized sum, mean, and min/max aggregations) and impossibility results for standard MPNNs on certain tasks, while offering more expressive architectures to overcome these limits. A key advance is a differentiable regularization approach that tightens Bellman--Ford extrapolation and enables explicit, small training sets with provable size generalization guarantees. The empirical study supports the theory, demonstrating practical gains in size generalization and the benefits of the proposed regularization in learning graph algorithms from limited data.

Abstract

In recent years, there has been growing interest in understanding neural architectures' ability to learn to execute discrete algorithms, a line of work often referred to as neural algorithmic reasoning. The goal is to integrate algorithmic reasoning capabilities into larger neural pipelines. Many such architectures are based on (message-passing) graph neural networks (MPNNs), owing to their permutation equivariance and ability to deal with sparsity and variable-sized inputs. However, existing work is either largely empirical and lacks formal guarantees or it focuses solely on expressivity, leaving open the question of when and how such architectures generalize beyond a finite training set. In this work, we propose a general theoretical framework that characterizes the sufficient conditions under which MPNNs can learn an algorithm from a training set of small instances and provably approximate its behavior on inputs of arbitrary size. Our framework applies to a broad class of algorithms, including single-source shortest paths, minimum spanning trees, and general dynamic programming problems, such as the - knapsack problem. In addition, we establish impossibility results for a wide range of algorithmic tasks, showing that standard MPNNs cannot learn them, and we derive more expressive MPNN-like architectures that overcome these limitations. Finally, we refine our analysis for the Bellman-Ford algorithm, yielding a substantially smaller required training set and significantly extending the recent work of Nerem et al. [2025] by allowing for a differentiable regularization loss. Empirical results largely support our theoretical findings.
Paper Structure (133 sections, 68 theorems, 332 equations, 8 figures, 8 tables)

This paper contains 133 sections, 68 theorems, 332 equations, 8 figures, 8 tables.

Key Result

Proposition 1

Let $\mathcal{G}$ be the set of attributed graphs with $n$ vertices and vertex/edge attributes taking values in a compact set of $\mathbb{R}^d$. Let $\textsf{alg}\in\{$1$\textrm{-}\textsf{WL},$1$\textrm{-}\textsf{iWL},$(1,1)$\textrm{-}\textsf{WL}\}$, and let $g\colon V_k(\mathcal{G})\to \mathbb{R}$

Figures (8)

  • Figure 1: An illustration of the learnability result in \ref{['thm:specific_regularization']}, applied to MPNNs by first mapping the space $V_1(\mathcal{G})$ to a pseudometric space (via IDMs or computation trees; see \ref{['subsec:idm_pseudometrics']}), satisfying \ref{['def:finite_uniform_approx']} and then applying \ref{['thm:specific_regularization']}. A computation-tree construction is shown on the right.
  • Figure 2: Training error and test score (lower is better) for size generalization experiments in Q1 using test datasets with 64 and 1024 vertices, respectively. Values were smoothed using Gaussian smoothing with $\sigma =1$. The gray region indicates the loss values for which \ref{['thm:bf_informal']} guarantees extrapolation.
  • Figure 3: Bellman--Ford training graphs. Dots indicate omitted intermediate vertices, and edge weights are shown on the edges. (a) General path graph associated with $\mathbold{w}\in\mathbb{R}^{K+1}$ as in \ref{['def:path_graph']}. The initial vertex is labeled $a(v_0)=w_0$, while all other vertices have label $\beta\gg 0$. The path has length $K$ with edge weights $w_1,\dots,w_K$. (b) Bellman--Ford training set for arbitrary $K$ as defined in \ref{['def:BF-training_set']}, consisting of $K+1$ path graphs corresponding to the scaled unit vectors $x\mathbold{e}_0^{K+1},\dots,x\mathbold{e}_K^{K+1}$. Each path contains $K+1$ vertices $v_0,\dots,v_K$. The root vertex satisfies $a(v_0)=x$ if $k=0$ and $a(v_0)=0$ otherwise, while all other vertices have label $\beta$. Exactly one edge per path has weight $x$, and all remaining edges have weight $0$.
  • Figure 4: Edge case graph from \ref{['def:edge_case_graph']} for learning the $K$-fold Bellman--Ford update with higher aggregation dimension. The root vertex is $v_{-1}$. Shown here is the instance for $K=4$. At each vertex $v_i$, $i\in[K]$, the network can choose between multiple paths of equal total weight $x$.
  • Figure 5: A compact space of computation trees enables algorithmic generalization. Under an appropriate metric on $\mathcal{T}$, the computation trees $T_{\mathrm{small}}$ and $T_{\mathrm{large}}$ are close, so regularization-induced Lipschitz continuity of the model implies that good performance on the training instance $T_{\mathrm{small}}$ transfers to good performance on the nearby instance $T_{\mathrm{large}}$. Occurrences of $v$ in the computation tree other than the root are omitted for simplicity.
  • ...and 3 more figures

Theorems & Definitions (151)

  • Proposition 1
  • Definition 2
  • Theorem 3: Informal
  • Theorem 4: Informal
  • Theorem 5: Informal
  • Proposition 6: Informal
  • Lemma 7: Informal
  • Lemma 8
  • proof
  • Theorem 9: \ref{['thm:specific_regularization']} in the main text
  • ...and 141 more