Table of Contents
Fetching ...

Over-squashing in Spatiotemporal Graph Neural Networks

Ivan Marisca, Jacob Bamberger, Cesare Alippi, Michael M. Bronstein

TL;DR

This paper formalizes spatiotemporal over-squashing in STGNNs, showing that the temporal axis introduces new bottlenecks and that causal convolutions can bias information toward temporally distant inputs. It develops a theoretical framework that factorizes propagation into independent spatial and temporal components, and provides bounds that separate model parameters from topological structure. The authors introduce spatiotemporal designs (notably mptcn) and analyze time-and-space versus time-then-space budgets, proving that both paradigms are equally susceptible to oversquashing and that mitigating both dimensions is necessary. They propose temporal graph rewiring and row-normalization as practical mitigation strategies, and validate their theory with synthetic tasks and real-world forecasting benchmarks, offering principled guidance for robust and scalable STGNNs.

Abstract

Graph Neural Networks (GNNs) have achieved remarkable success across various domains. However, recent theoretical advances have identified fundamental limitations in their information propagation capabilities, such as over-squashing, where distant nodes fail to effectively exchange information. While extensively studied in static contexts, this issue remains unexplored in Spatiotemporal GNNs (STGNNs), which process sequences associated with graph nodes. Nonetheless, the temporal dimension amplifies this challenge by increasing the information that must be propagated. In this work, we formalize the spatiotemporal over-squashing problem and demonstrate its distinct characteristics compared to the static case. Our analysis reveals that, counterintuitively, convolutional STGNNs favor information propagation from points temporally distant rather than close in time. Moreover, we prove that architectures that follow either time-and-space or time-then-space processing paradigms are equally affected by this phenomenon, providing theoretical justification for computationally efficient implementations. We validate our findings on synthetic and real-world datasets, providing deeper insights into their operational dynamics and principled guidance for more effective designs.

Over-squashing in Spatiotemporal Graph Neural Networks

TL;DR

This paper formalizes spatiotemporal over-squashing in STGNNs, showing that the temporal axis introduces new bottlenecks and that causal convolutions can bias information toward temporally distant inputs. It develops a theoretical framework that factorizes propagation into independent spatial and temporal components, and provides bounds that separate model parameters from topological structure. The authors introduce spatiotemporal designs (notably mptcn) and analyze time-and-space versus time-then-space budgets, proving that both paradigms are equally susceptible to oversquashing and that mitigating both dimensions is necessary. They propose temporal graph rewiring and row-normalization as practical mitigation strategies, and validate their theory with synthetic tasks and real-world forecasting benchmarks, offering principled guidance for robust and scalable STGNNs.

Abstract

Graph Neural Networks (GNNs) have achieved remarkable success across various domains. However, recent theoretical advances have identified fundamental limitations in their information propagation capabilities, such as over-squashing, where distant nodes fail to effectively exchange information. While extensively studied in static contexts, this issue remains unexplored in Spatiotemporal GNNs (STGNNs), which process sequences associated with graph nodes. Nonetheless, the temporal dimension amplifies this challenge by increasing the information that must be propagated. In this work, we formalize the spatiotemporal over-squashing problem and demonstrate its distinct characteristics compared to the static case. Our analysis reveals that, counterintuitively, convolutional STGNNs favor information propagation from points temporally distant rather than close in time. Moreover, we prove that architectures that follow either time-and-space or time-then-space processing paradigms are equally affected by this phenomenon, providing theoretical justification for computationally efficient implementations. We validate our findings on synthetic and real-world datasets, providing deeper insights into their operational dynamics and principled guidance for more effective designs.

Paper Structure

This paper contains 35 sections, 9 theorems, 47 equations, 7 figures, 5 tables.

Key Result

Theorem 4.1

Consider a with $L_{\mathsf{T}}$ successive $\mathop{\mathrm{\mathsf{TC}}}\nolimits$ layers as in eq:tcn_ref, all with kernel size $P$, and assume that $\|{\bm{W}}_{p}^{(l)}\| \le \mathsf{w}$ for all $p<P$ and $l\leq L_{\mathsf{T}}$, and that $|\sigma^\prime| \leq c_\sigma$. For each $i, j \in [0, T

Figures (7)

  • Figure 1: Example of spatiotemporal topology governing information propagation in stgnn. The increasing receptive fields of graph-based and sequence-processing architectures compound, as shown in the Cartesian product of spatial and temporal graphs on the right.
  • Figure 2: Top row: paths for information flow from the most recent and an earlier time step to the last-layer representation at time $t$. Bottom row: evolution of the temporal receptive field after 4 and 20 layers, seen through the powers of the temporal topology matrix. For standard (${\mathbf{R}}$) and dilated (${\mathbf{R}}_D$) convolution, the highest-influence region shifts towards the initial time step, while for row-normalized (${\mathbf{R}}_N$) convolution, we observe a progressive shift to a uniform distribution across all time steps (first column). Entries are scaled matrix-wise in the range $[0,1]$ for comparison purposes.
  • Figure 3: Success rate (%) on the tasks of copying the first or last observed value across different temporal topologies and number of layers $L_{\mathsf{T}}$.
  • Figure 4: Success rate (%) of mptcns on the dataset, where the goal is to copy the average value associated with $k$-hop neighbors at time step $t-i$. The tasks vary for the type of graph used (Ring or Lollipop) and size of $P$ ($2$ or $3$).
  • Figure 5: Sensitivity of last-layer representations associated with last time step $t$ to earlier ones in with $L$ layers and kernel size $P=3$. The values correspond to entries $\left({\mathbf{R}}^L\right)_{i0}$ for the standard convolution (a) and $\left({\mathbf{R}}_N^L\right)_{i0}$ for the normalized convolution (b), with $i \ge 0$ being the backward distance from $t$. As depth increases, the standard convolution favors information from earlier steps, while the normalized version asymptotically approaches uniform sensitivity across all steps.
  • ...and 2 more figures

Theorems & Definitions (15)

  • Theorem 4.1
  • Proposition 4.1
  • Theorem 5.1
  • Lemma A.1: Single $\mathop{\mathrm{\mathsf{TC}}}\nolimits$ layer
  • proof
  • Theorem A.1
  • proof
  • Proposition A.1
  • proof
  • Proposition A.2
  • ...and 5 more