Table of Contents
Fetching ...

Theoretical Compression Bounds for Wide Multilayer Perceptrons

Houssam El Cheairi, David Gamarnik, Rahul Mazumder

TL;DR

The paper develops a rigorous, data-agnostic theory for post-training compression of wide neural networks via a randomized greedy pruning/quantization scheme. By leveraging second-order loss approximations and a Lindeberg interpolation framework, it proves the existence of pruned and/or quantized subnetworks for both unstructured and structured pruning, extending from MLPs to CNNs through a convolution-to-MLP representation. The main contributions include quantitative bounds on the excess loss, explicit sparsity/quantization levels, and width/bottleneck conditions that enable linear sparsity at scale, supported by numerical simulations. Overall, the work provides a principled justification for the empirical success of compression in wide networks and clarifies how architectural properties influence compressibility.

Abstract

Pruning and quantization techniques have been broadly successful in reducing the number of parameters needed for large neural networks, yet theoretical justification for their empirical success falls short. We consider a randomized greedy compression algorithm for pruning and quantization post-training and use it to rigorously show the existence of pruned/quantized subnetworks of multilayer perceptrons (MLPs) with competitive performance. We further extend our results to structured pruning of MLPs and convolutional neural networks (CNNs), thus providing a unified analysis of pruning in wide networks. Our results are free of data assumptions, and showcase a tradeoff between compressibility and network width. The algorithm we consider bears some similarities with Optimal Brain Damage (OBD) and can be viewed as a post-training randomized version of it. The theoretical results we derive bridge the gap between theory and application for pruning/quantization, and provide a justification for the empirical success of compression in wide multilayer perceptrons.

Theoretical Compression Bounds for Wide Multilayer Perceptrons

TL;DR

The paper develops a rigorous, data-agnostic theory for post-training compression of wide neural networks via a randomized greedy pruning/quantization scheme. By leveraging second-order loss approximations and a Lindeberg interpolation framework, it proves the existence of pruned and/or quantized subnetworks for both unstructured and structured pruning, extending from MLPs to CNNs through a convolution-to-MLP representation. The main contributions include quantitative bounds on the excess loss, explicit sparsity/quantization levels, and width/bottleneck conditions that enable linear sparsity at scale, supported by numerical simulations. Overall, the work provides a principled justification for the empirical success of compression in wide networks and clarifies how architectural properties influence compressibility.

Abstract

Pruning and quantization techniques have been broadly successful in reducing the number of parameters needed for large neural networks, yet theoretical justification for their empirical success falls short. We consider a randomized greedy compression algorithm for pruning and quantization post-training and use it to rigorously show the existence of pruned/quantized subnetworks of multilayer perceptrons (MLPs) with competitive performance. We further extend our results to structured pruning of MLPs and convolutional neural networks (CNNs), thus providing a unified analysis of pruning in wide networks. Our results are free of data assumptions, and showcase a tradeoff between compressibility and network width. The algorithm we consider bears some similarities with Optimal Brain Damage (OBD) and can be viewed as a post-training randomized version of it. The theoretical results we derive bridge the gap between theory and application for pruning/quantization, and provide a justification for the empirical success of compression in wide multilayer perceptrons.

Paper Structure

This paper contains 33 sections, 24 theorems, 185 equations, 5 figures, 1 algorithm.

Key Result

Proposition 1

Suppose (chapter3:assumption:1.1)-(chapter3:assumption:1.5) hold in Assumption chapter3:assumption:1. Let $\mathcal{R}$ be a distribution over $\mathbb{B}^{n_1}_{2}(1)$ and $\mathbf{x} \sim \mathcal{R}$. Given $\xi\in(0,1)$ and $p\in (0,1)$ there exists positive constants $\delta=\delta(\xi)$ and $n with $\alpha\approx 0.99$, then there exists a network $\hat{\Phi}$ given by $\hat{\Phi}(\mathbf{x}

Figures (5)

  • Figure 1: Evaluation plots for unstructured pruning on the California Housing dataset using an MLP model $\Phi_w(\mathbf{x}) = \mathbf{W}_3{\rm ReLU}(\mathbf{W}_2 {\rm ReLU}(\mathbf{W}_1 \mathbf{x}))$ with $\mathbf{W}_1\in \mathbb{R}^{w\times 8}, \mathbf{W}_2\in \mathbb{R}^{40\times w}, \mathbf{W}_3\in \mathbb{R}^{1\times 40}$.
  • Figure 2: Evaluation plots for unstructured pruning on the Digits dataset using an MLP model $\Phi_w(\mathbf{x}) = {\rm Softmax}(\mathbf{W}_3{\rm ReLU}(\mathbf{W}_2 {\rm ReLU}(\mathbf{W}_1 \mathbf{x})))$ with $\mathbf{W}_1\in \mathbb{R}^{w\times 64}, \mathbf{W}_2\in \mathbb{R}^{40\times w}, \mathbf{W}_3\in \mathbb{R}^{10\times 40}$.
  • Figure 3: Evaluation plots for structured pruning on the California Housing dataset using an MLP model $\Phi_w(\mathbf{x}) = \mathbf{W}_3{\rm ReLU}(\mathbf{W}_2 {\rm ReLU}(\mathbf{W}_1 \mathbf{x}))$ with $\mathbf{W}_1\in \mathbb{R}^{w\times 8}, \mathbf{W}_2\in \mathbb{R}^{20\times w}, \mathbf{W}_3\in \mathbb{R}^{1\times 20}$.
  • Figure 4: Evaluation plots for structured pruning on the Digits dataset using an MLP model $\Phi_w(\mathbf{x}) = {\rm Softmax}(\mathbf{W}_3{\rm ReLU}(\mathbf{W}_2 {\rm ReLU}(\mathbf{W}_1 \mathbf{x})))$ with $\mathbf{W}_1\in \mathbb{R}^{w\times 64}, \mathbf{W}_2\in \mathbb{R}^{20\times w}, \mathbf{W}_3\in \mathbb{R}^{10\times 20}$.
  • Figure 5: Evaluation plots for structured pruning on the Digits dataset using a CNN model $\Phi_w(\mathbf{x}) = {\rm Softmax}(\mathbf{W}_{\rm fc}{\rm ReLU}(\mathbf{K}_2 {\rm ReLU}(\mathbf{K}_1 \mathbf{x})))$ with $\mathbf{K}_1\in \mathbb{R}^{w\times 1\times 3\times 3}, \mathbf{K}_2\in \mathbb{R}^{16 \times w \times 3 \times 3}, \mathbf{W}_{\rm fc}\in \mathbb{R}^{10\times 1024}$.

Theorems & Definitions (48)

  • Definition 1: Gate Matrix
  • Proposition 1
  • Proposition 2
  • Remark 1
  • Corollary 1
  • Proposition 3
  • Proposition 4
  • Lemma 1
  • proof
  • Lemma 2
  • ...and 38 more