Pruning at Initialisation through the lens of Graphon Limit: Convergence, Expressivity, and Generalisation

Hoang Pham; The-Anh Ta; Long Tran-Thanh

Pruning at Initialisation through the lens of Graphon Limit: Convergence, Expressivity, and Generalisation

Hoang Pham, The-Anh Ta, Long Tran-Thanh

TL;DR

The paper tackles how pruning-at-initialisation (PaI) shapes large sparse neural networks by formulating a graphon-limit theory for Pai masks under a Factorised Saliency Model. It proves that, as width grows, PaI masks converge in probability to deterministic bipartite graphons, enabling a topological taxonomy that separates unstructured (constant graphons) from data-driven pruning (heterogeneous graphons). It then establishes a Universal Approximation Theorem on active coordinate subspaces and a Graphon-NTK continuation that yields a generalisation bound driven by path density through the limit graphon. Empirically, the theory is validated via visual convergence of finite-width masks to the predicted graphons and by showing data-dependent pruning concentrates connectivity on informative coordinates, improving kernel alignment and generalisation under sparsity. These results recast sparse network analysis in terms of continuous operators, providing principled guidance for designing efficient PaI strategies with predictable expressivity and generalisation.

Abstract

Pruning at Initialisation methods discover sparse, trainable subnetworks before training, but their theoretical mechanisms remain elusive. Existing analyses are often limited to finite-width statistics, lacking a rigorous characterisation of the global sparsity patterns that emerge as networks grow large. In this work, we connect discrete pruning heuristics to graph limit theory via graphons, establishing the graphon limit of PaI masks. We introduce a Factorised Saliency Model that encompasses popular pruning criteria and prove that, under regularity conditions, the discrete masks generated by these algorithms converge to deterministic bipartite graphons. This limit framework establishes a novel topological taxonomy for sparse networks: while unstructured methods (e.g., Random, Magnitude) converge to homogeneous graphons representing uniform connectivity, data-driven methods (e.g., SNIP, GraSP) converge to heterogeneous graphons that encode implicit feature selection. Leveraging this continuous characterisation, we derive two fundamental theoretical results: (i) a Universal Approximation Theorem for sparse networks that depends only on the intrinsic dimension of active coordinate subspaces; and (ii) a Graphon-NTK generalisation bound demonstrating how the limit graphon modulates the kernel geometry to align with informative features. Our results transform the study of sparse neural networks from combinatorial graph problems into a rigorous framework of continuous operators, offering a new mechanism for analysing expressivity and generalisation in sparse neural networks.

Pruning at Initialisation through the lens of Graphon Limit: Convergence, Expressivity, and Generalisation

TL;DR

Abstract

Paper Structure (90 sections, 16 theorems, 144 equations, 11 figures, 1 table)

This paper contains 90 sections, 16 theorems, 144 equations, 11 figures, 1 table.

Introduction
Related works
Sparse neural networks.
Neural tangent kernels and generalisation error bounds.
Graphon limit of sparse neural networks.
Approximation theory for neural networks.
Preliminaries and settings
Preliminaries
Bipartite graphons and step kernels.
Cut norm and cut distance.
Setting and notations
Masks as bipartite graphs and step-kernel embedding.
Other notations.
Graphon convergence of PaI methods
The factorised saliency model
...and 75 more sections

Key Result

Theorem 4.7

Let $d\to\infty$ and $n\to\infty$. Consider the factorised saliency model where $i\in[d],\; j\in[n]$, and let $W_n:=W_{M_n}$ be the associated bipartite step-kernel on $[0,1]^2$. Under assumptions Aassum:growth - Aassum:threshold, we define the deterministic limit graphon Then $\delta^{\mathrm{bip}}_\square(W_n,\mathcal{W})\xrightarrow{\mathop{\mathrm{\mathbb{P}}}\nolimits}0$.

Figures (11)

Figure 1: Visual convergence to the graphon limit. We compare averaged empirical masks (over 100 seeds) at increasing widths ($n=200, 500, 1000, 2000, 4000$) against the analytically computed Theoretical Graphon. The density is fixed at $\rho=0.2$.
Figure 2: Sensitivity of Graphon-NTK Complexity to Label Noise and Sparsity. Each panel plots the theoretical complexity $y^\top K_{\mathcal{W}}^{-1}y$ (y-axis) against the ratio of randomised labels (x-axis) for a specific density $\rho$.
Figure 3: Visual Convergence to the Graphon Limit. We compare empirical masks with different activation functions at increasing widths ($n=200, 500, 1000, 2000, 4000$) against the analytically computed Theoretical Graphon. The density is fixed at $\rho=0.1$.
Figure 4: Visual Convergence to the Graphon Limit. We compare empirical masks with different activation functions at increasing widths ($n=200, 500, 1000, 2000, 4000$) against the analytically computed Theoretical Graphon. The density is fixed at $\rho=0.2$.
Figure 5: Graphon Convergence of GraSP Variants. Empirical pruning masks for GraSP at increasing widths ($n \in \{200, \dots, 4000\}$) with density $\rho=20\%$. We compare Magnitude GraSP (rows 1, 3) and Signed GraSP (rows 2, 4) for Tanh and Sigmoid activations.
...and 6 more figures

Theorems & Definitions (37)

Remark 4.6
Theorem 4.7: Bipartite graphon convergence
Remark 4.8
Remark 4.9
Theorem 5.1: UAT for dense 1-hidden-layer neural networks on $\mathbb{R}^k$, cybenko1989approximation
Remark 5.5
Lemma 5.6: Dense core in a graphon-sparsified mask
Theorem 5.7: Universality on active coordinates
Remark 5.8
Theorem 6.1: Infinite-width Graphon-NTK generalisation bound
...and 27 more

Pruning at Initialisation through the lens of Graphon Limit: Convergence, Expressivity, and Generalisation

TL;DR

Abstract

Pruning at Initialisation through the lens of Graphon Limit: Convergence, Expressivity, and Generalisation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (37)