Table of Contents
Fetching ...

Pruning at Initialisation through the lens of Graphon Limit: Convergence, Expressivity, and Generalisation

Hoang Pham, The-Anh Ta, Long Tran-Thanh

TL;DR

The paper tackles how pruning-at-initialisation (PaI) shapes large sparse neural networks by formulating a graphon-limit theory for Pai masks under a Factorised Saliency Model. It proves that, as width grows, PaI masks converge in probability to deterministic bipartite graphons, enabling a topological taxonomy that separates unstructured (constant graphons) from data-driven pruning (heterogeneous graphons). It then establishes a Universal Approximation Theorem on active coordinate subspaces and a Graphon-NTK continuation that yields a generalisation bound driven by path density through the limit graphon. Empirically, the theory is validated via visual convergence of finite-width masks to the predicted graphons and by showing data-dependent pruning concentrates connectivity on informative coordinates, improving kernel alignment and generalisation under sparsity. These results recast sparse network analysis in terms of continuous operators, providing principled guidance for designing efficient PaI strategies with predictable expressivity and generalisation.

Abstract

Pruning at Initialisation methods discover sparse, trainable subnetworks before training, but their theoretical mechanisms remain elusive. Existing analyses are often limited to finite-width statistics, lacking a rigorous characterisation of the global sparsity patterns that emerge as networks grow large. In this work, we connect discrete pruning heuristics to graph limit theory via graphons, establishing the graphon limit of PaI masks. We introduce a Factorised Saliency Model that encompasses popular pruning criteria and prove that, under regularity conditions, the discrete masks generated by these algorithms converge to deterministic bipartite graphons. This limit framework establishes a novel topological taxonomy for sparse networks: while unstructured methods (e.g., Random, Magnitude) converge to homogeneous graphons representing uniform connectivity, data-driven methods (e.g., SNIP, GraSP) converge to heterogeneous graphons that encode implicit feature selection. Leveraging this continuous characterisation, we derive two fundamental theoretical results: (i) a Universal Approximation Theorem for sparse networks that depends only on the intrinsic dimension of active coordinate subspaces; and (ii) a Graphon-NTK generalisation bound demonstrating how the limit graphon modulates the kernel geometry to align with informative features. Our results transform the study of sparse neural networks from combinatorial graph problems into a rigorous framework of continuous operators, offering a new mechanism for analysing expressivity and generalisation in sparse neural networks.

Pruning at Initialisation through the lens of Graphon Limit: Convergence, Expressivity, and Generalisation

TL;DR

The paper tackles how pruning-at-initialisation (PaI) shapes large sparse neural networks by formulating a graphon-limit theory for Pai masks under a Factorised Saliency Model. It proves that, as width grows, PaI masks converge in probability to deterministic bipartite graphons, enabling a topological taxonomy that separates unstructured (constant graphons) from data-driven pruning (heterogeneous graphons). It then establishes a Universal Approximation Theorem on active coordinate subspaces and a Graphon-NTK continuation that yields a generalisation bound driven by path density through the limit graphon. Empirically, the theory is validated via visual convergence of finite-width masks to the predicted graphons and by showing data-dependent pruning concentrates connectivity on informative coordinates, improving kernel alignment and generalisation under sparsity. These results recast sparse network analysis in terms of continuous operators, providing principled guidance for designing efficient PaI strategies with predictable expressivity and generalisation.

Abstract

Pruning at Initialisation methods discover sparse, trainable subnetworks before training, but their theoretical mechanisms remain elusive. Existing analyses are often limited to finite-width statistics, lacking a rigorous characterisation of the global sparsity patterns that emerge as networks grow large. In this work, we connect discrete pruning heuristics to graph limit theory via graphons, establishing the graphon limit of PaI masks. We introduce a Factorised Saliency Model that encompasses popular pruning criteria and prove that, under regularity conditions, the discrete masks generated by these algorithms converge to deterministic bipartite graphons. This limit framework establishes a novel topological taxonomy for sparse networks: while unstructured methods (e.g., Random, Magnitude) converge to homogeneous graphons representing uniform connectivity, data-driven methods (e.g., SNIP, GraSP) converge to heterogeneous graphons that encode implicit feature selection. Leveraging this continuous characterisation, we derive two fundamental theoretical results: (i) a Universal Approximation Theorem for sparse networks that depends only on the intrinsic dimension of active coordinate subspaces; and (ii) a Graphon-NTK generalisation bound demonstrating how the limit graphon modulates the kernel geometry to align with informative features. Our results transform the study of sparse neural networks from combinatorial graph problems into a rigorous framework of continuous operators, offering a new mechanism for analysing expressivity and generalisation in sparse neural networks.
Paper Structure (90 sections, 16 theorems, 144 equations, 11 figures, 1 table)

This paper contains 90 sections, 16 theorems, 144 equations, 11 figures, 1 table.

Key Result

Theorem 4.7

Let $d\to\infty$ and $n\to\infty$. Consider the factorised saliency model where $i\in[d],\; j\in[n]$, and let $W_n:=W_{M_n}$ be the associated bipartite step-kernel on $[0,1]^2$. Under assumptions Aassum:growth - Aassum:threshold, we define the deterministic limit graphon Then $\delta^{\mathrm{bip}}_\square(W_n,\mathcal{W})\xrightarrow{\mathop{\mathrm{\mathbb{P}}}\nolimits}0$.

Figures (11)

  • Figure 1: Visual convergence to the graphon limit. We compare averaged empirical masks (over 100 seeds) at increasing widths ($n=200, 500, 1000, 2000, 4000$) against the analytically computed Theoretical Graphon. The density is fixed at $\rho=0.2$.
  • Figure 2: Sensitivity of Graphon-NTK Complexity to Label Noise and Sparsity. Each panel plots the theoretical complexity $y^\top K_{\mathcal{W}}^{-1}y$ (y-axis) against the ratio of randomised labels (x-axis) for a specific density $\rho$.
  • Figure 3: Visual Convergence to the Graphon Limit. We compare empirical masks with different activation functions at increasing widths ($n=200, 500, 1000, 2000, 4000$) against the analytically computed Theoretical Graphon. The density is fixed at $\rho=0.1$.
  • Figure 4: Visual Convergence to the Graphon Limit. We compare empirical masks with different activation functions at increasing widths ($n=200, 500, 1000, 2000, 4000$) against the analytically computed Theoretical Graphon. The density is fixed at $\rho=0.2$.
  • Figure 5: Graphon Convergence of GraSP Variants. Empirical pruning masks for GraSP at increasing widths ($n \in \{200, \dots, 4000\}$) with density $\rho=20\%$. We compare Magnitude GraSP (rows 1, 3) and Signed GraSP (rows 2, 4) for Tanh and Sigmoid activations.
  • ...and 6 more figures

Theorems & Definitions (37)

  • Remark 4.6
  • Theorem 4.7: Bipartite graphon convergence
  • Remark 4.8
  • Remark 4.9
  • Theorem 5.1: UAT for dense 1-hidden-layer neural networks on $\mathbb{R}^k$, cybenko1989approximation
  • Remark 5.5
  • Lemma 5.6: Dense core in a graphon-sparsified mask
  • Theorem 5.7: Universality on active coordinates
  • Remark 5.8
  • Theorem 6.1: Infinite-width Graphon-NTK generalisation bound
  • ...and 27 more