How does over-squashing affect the power of GNNs?

Francesco Di Giovanni; T. Konstantin Rusch; Michael M. Bronstein; Andreea Deac; Marc Lackenby; Siddhartha Mishra; Petar Veličković

How does over-squashing affect the power of GNNs?

Francesco Di Giovanni, T. Konstantin Rusch, Michael M. Bronstein, Andreea Deac, Marc Lackenby, Siddhartha Mishra, Petar Veličković

TL;DR

This work analyzes how over-squashing limits the expressive power of MPNNs by introducing a Hessian-based pairwise mixing measure that quantifies how well node features can interact under message passing. The authors derive a general bound on mixing that depends on network capacity, via depth $m$ and weight norm $\mathsf{w}$, and on graph topology through the operator $S$ and its higher-order corrections, highlighting the role of commute times. They define over-squashing as the inverse of maximal mixing and introduce a computable proxy $\widetilde{\mathsf{OSQ}}$ to obtain necessary conditions on capacity for learning functions with prescribed mixing; they prove that, in bounded-depth or bounded-weight regimes, achieving high mixing becomes impractical on graphs with large commute times. Experimental validation on synthetic ZINC graphs shows that increasing commute time degrades performance and increases OSQ, while deeper architectures can mitigate these effects, illustrating practical implications and guiding remedies such as graph rewiring or more expressive architectures like Graph Transformers. Overall, the paper provides a rigorous framework linking over-squashing, graph topology, and GNN expressive power, with concrete bounds and empirical confirmation that inform the design of scalable GNNs for long-range relational tasks.

Abstract

Graph Neural Networks (GNNs) are the state-of-the-art model for machine learning on graph-structured data. The most popular class of GNNs operate by exchanging information between adjacent nodes, and are known as Message Passing Neural Networks (MPNNs). Given their widespread use, understanding the expressive power of MPNNs is a key question. However, existing results typically consider settings with uninformative node features. In this paper, we provide a rigorous analysis to determine which function classes of node features can be learned by an MPNN of a given capacity. We do so by measuring the level of pairwise interactions between nodes that MPNNs allow for. This measure provides a novel quantitative characterization of the so-called over-squashing effect, which is observed to occur when a large volume of messages is aggregated into fixed-size vectors. Using our measure, we prove that, to guarantee sufficient communication between pairs of nodes, the capacity of the MPNN must be large enough, depending on properties of the input graph structure, such as commute times. For many relevant scenarios, our analysis results in impossibility statements in practice, showing that over-squashing hinders the expressive power of MPNNs. We validate our theoretical findings through extensive controlled experiments and ablation studies.

How does over-squashing affect the power of GNNs?

TL;DR

and weight norm

, and on graph topology through the operator

and its higher-order corrections, highlighting the role of commute times. They define over-squashing as the inverse of maximal mixing and introduce a computable proxy

to obtain necessary conditions on capacity for learning functions with prescribed mixing; they prove that, in bounded-depth or bounded-weight regimes, achieving high mixing becomes impractical on graphs with large commute times. Experimental validation on synthetic ZINC graphs shows that increasing commute time degrades performance and increases OSQ, while deeper architectures can mitigate these effects, illustrating practical implications and guiding remedies such as graph rewiring or more expressive architectures like Graph Transformers. Overall, the paper provides a rigorous framework linking over-squashing, graph topology, and GNN expressive power, with concrete bounds and empirical confirmation that inform the design of scalable GNNs for long-range relational tasks.

Abstract

Paper Structure (28 sections, 16 theorems, 113 equations, 8 figures, 1 table)

This paper contains 28 sections, 16 theorems, 113 equations, 8 figures, 1 table.

Introduction
The Message-Passing paradigm
MPNNs on geometric graphs.
On the mixing induced by Message Passing Neural Networks
Pairwise mixing induced by $\text{MPNN}$s
Over-squashing limits the expressive power of $\text{MPNN}$s
The case of fixed depth $m$ and variable weights norm $\mathsf{w}$
The case of fixed weights norm $\mathsf{w}$ and variable depth $m$
Experimental validation of the theoretical results
The role of commute time
The role of depth
The role of mixing
Discussion
Limitations and ways forward
Outline of the appendix
...and 13 more sections

Key Result

Theorem 3.2

Consider an MPNN of depth $m$ as in eq:mpnn_message_functions, where $\sigma$ and $\psi^{(t)}$ are $\mathcal{C}^2$ functions and we denote the bounds on their derivatives and on the norm of the weights as above. Let $\boldsymbol{\mathsf{S}}$ and $\boldsymbol{\mathsf{Q}}_k$ be defined as in eq:def_g

Figures (8)

Figure 1: We study the power of $\text{MPNN}$s in terms of the mixing they induce among features and show that this is affected by the model (via norm of the weights and depth) and the graph topology (via commute times). For the given graph, the $\text{MPNN}$ learns stronger mixing (tight springs) for nodes $v,u$ and $u,w$ since their commute time is small, while nodes $u,q$ and $u,z$, with high commute-time, have weak mixing (loose springs). We characterize over-squashing as the inverse of the mixing induced by an $\text{MPNN}$ and hence relate it to its power. In fact, the $\text{MPNN}$ might require an impractical depth to solve tasks on the given graph that depend on high-mixing of features assigned to $u,z$.
Figure 2: (Left) Exemplary molecular graph of the ZINC (12K) dataset with colored nodes corresponding to different values of commute time $\tau$. We note that $\tau$ is a more refined measure than the distance, and in fact beyond long-range nodes (red case), $\tau$ also captures other topological properties (yellow nodes are adjacent but belong to a cut-edge, so their commute-time is $2|\mathsf{E}|$). (Right) Histogram of commute time $\tau$ between all pairs of the graph nodes.
Figure 3: Test MAE (average and standard deviation over several random weight initializations) of GCN, GIN, GraphSAGE, and GatedGCN on synthetic ZINC, where the commute time of the underlying mixing is varied, while the $\text{MPNN}$ architecture is fixed (e.g., depth, number of parameters), i.e., mixing according to increasing values of the $\alpha$-quantile of the $\tau$-distribution over the ZINC graphs.
Figure 4: Test MAE (average and standard deviation over several random weight initializations) of GCN, GIN, GraphSAGE, and GatedGCN on synthetic ZINC, where the commute time is fixed to be high (i.e., at the level of the $0.8$-quantile), while only the depth of the underlying $\text{MPNN}$ is varied between $4$ and $32$ (all other architectural components are fixed).
Figure 5: Train MAE of GCN, GIN, GraphSAGE, and GatedGCN on synthetic ZINC, where the commute time of the underlying mixing is varied, while the $\text{MPNN}$ architecture is fixed (e.g., depth, number of parameters), i.e., mixing according to increasing values of the $\alpha$-quantile of the $\tau$-distribution over the ZINC graphs.
...and 3 more figures

Theorems & Definitions (32)

Definition 3.1
Theorem 3.2
Definition 3.3
Definition 4.1
Definition 4.2
Theorem 4.3
Theorem 4.4
Corollary 4.5
Theorem C.1
proof
...and 22 more

How does over-squashing affect the power of GNNs?

TL;DR

Abstract

How does over-squashing affect the power of GNNs?

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (32)