Structured vs. Unstructured Pruning: An Exponential Gap

Davide Ferre'; Frédéric Giroire; Frederik Mallmann-Trenn; Emanuele Natale

Structured vs. Unstructured Pruning: An Exponential Gap

Davide Ferre', Frédéric Giroire, Frederik Mallmann-Trenn, Emanuele Natale

TL;DR

It is shown that neuron pruning requires a starting network with $\Omega(d/\varepsilon)$ hidden neurons to $\varepsilon$-approximate a target ReLU neuron, and weight pruning achieves $\varepsilon$-approximation with only $O(d\log(1/\varepsilon)$ neurons, establishing an exponential separation between the two pruning paradigms.

Abstract

The Strong Lottery Ticket Hypothesis (SLTH) posits that large, randomly initialized neural networks contain sparse subnetworks capable of approximating a target function at initialization without training, suggesting that pruning alone is sufficient. Pruning methods are typically classified as unstructured, where individual weights can be removed from the network, and structured, where parameters are removed according to specific patterns, as in neuron pruning. Existing theoretical results supporting the SLTH rely almost exclusively on unstructured pruning, showing that logarithmic overparameterization suffices to approximate simple target networks. In contrast, neuron pruning has received limited theoretical attention. In this work, we consider the problem of approximating a single bias-free ReLU neuron using a randomly initialized bias-free two-layer ReLU network, thereby isolating the intrinsic limitations of neuron pruning. We show that neuron pruning requires a starting network with $Ω(d/\varepsilon)$ hidden neurons to $\varepsilon$-approximate a target ReLU neuron. In contrast, weight pruning achieves $\varepsilon$-approximation with only $O(d\log(1/\varepsilon))$ neurons, establishing an exponential separation between the two pruning paradigms.

Structured vs. Unstructured Pruning: An Exponential Gap

TL;DR

It is shown that neuron pruning requires a starting network with

hidden neurons to

-approximate a target ReLU neuron, and weight pruning achieves

-approximation with only

neurons, establishing an exponential separation between the two pruning paradigms.

Abstract

hidden neurons to

-approximate a target ReLU neuron. In contrast, weight pruning achieves

-approximation with only

neurons, establishing an exponential separation between the two pruning paradigms.

Paper Structure (18 sections, 16 theorems, 32 equations, 3 figures)

This paper contains 18 sections, 16 theorems, 32 equations, 3 figures.

Introduction
Unstructured pruning and logarithmic overparameterization.
Structured pruning and neuron pruning.
Our Contribution
Related Work
Preliminaries and Setup
Main Result
Proof of \ref{['thm:pruning-lower-bound']}
Union bound over all pruned subnetworks
Restriction over simple input families
Breakpoints and necessary conditions for approximation
Construction of a dominating capped process
Construction of a dominating birth-death process
Back to the union bound
Conclusion and Future Work
...and 3 more sections

Key Result

Theorem 1

Let $d \ge 2$, $\varepsilon \in (0,1)$, and let $\mathbf{w^\star} \in \mathbb{R}^d$ with $\|\mathbf{w^\star}\|_2 = 1$. Consider a one hidden-layer ReLU network without bias of the form where the weights $\{\mathbf{w_i}\}_{i=1}^{N_h}$ are drawn independently from $\mathcal{N}(0,I_d)$, the coefficients $\{\alpha_i\}_{i=1}^{N_h}$ are drawn independently from $\mathcal{N}(0,1)$, and $\mathbf{x} \in \

Figures (3)

Figure 1: A target ReLU neuron $f$, a random network $g$ with $N_h=5$ hidden neurons, and a network $g_S$ obtained from $g$ through neuron pruning, by only keeping a subset $S$ of hidden units. The input dimension is $d=2$, and each hidden unit $i$ in $g$ has incoming weights $\mathbf{w}_{i} \in \mathbb{R}^2$ (not written) and an output coefficient $\alpha_i \in \mathbb{R}$.
Figure 2: Stacked state-line representation of the birth--death chain $(B^{\mathrm{bd}}_s)_{s\ge 0}$. Each horizontal line is the state space $\{0,1,\dots,T\}$ for one input family, and the filled dot marks the current value of $B^{\mathrm{bd}}_s$. A necessary condition for $\varepsilon$-approximation is that all $\lfloor d/2\rfloor$ independent chains reach state $0$.
Figure 3: Breakpoint alignment intuition along a one-dimensional input family $\mathbf{x}_i(t)$. Along $\mathbf{x}_i(t)$, a bias-free ReLU neuron $\sigma(\langle w,x\rangle)$ reduces to the one-dimensional function $t \mapsto \sigma(w_{2i-1}t + w_{2i})$, whose breakpoint is $t_i(w) = -w_{2i}/w_{2i-1}$. The target (red) has breakpoint $t_i^\star$, while a randomly drawn neuron (cyan) has breakpoint $t_{i,j}$. If $|t_{i,j}-t_i^\star|>\varepsilon$, then on the $\varepsilon$-neighborhood of $t_i^\star$ the cyan function is linear (in the picture, flat) whereas the target changes slope, yielding a nontrivial uniform error, as stated in \ref{['lem:breakpoint-necessary']}.

Theorems & Definitions (26)

Definition 1: $\varepsilon$-approximation
Theorem 1: Lower bound for neuron pruning
Definition 2: Broken bin
Lemma 1: Broken bin prevents approximation
Lemma 2: A breakpoint is necessary for approximation
Definition 3
Lemma 3
Lemma 4
Definition 4
Lemma 5
...and 16 more

Structured vs. Unstructured Pruning: An Exponential Gap

TL;DR

Abstract

Structured vs. Unstructured Pruning: An Exponential Gap

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (26)