Table of Contents
Fetching ...

Pruning Deep Neural Networks via a Combination of the Marchenko-Pastur Distribution and Regularization

Leonid Berlyand, Theo Bourdais, Houman Owhadi, Yitzchak Shmalo

TL;DR

The paper tackles the problem of pruning deep neural networks, especially Vision Transformers, by leveraging Random Matrix Theory and the Marchenko-Pastur distribution to separate noise from signal in weight matrices and singular vectors. It introduces a principled MP-based pruning workflow with metrics gamma and mu, achieving substantial parameter reductions (30-50%) with minimal accuracy loss on ImageNet and providing rigorous theory showing that training reduces randomness and that removing randomness lowers loss. The contributions include both practical pruning algorithms and generalized theoretical results (Gaussian and non-Gaussian settings) that connect regularization, spectral properties, and performance. The approach yields a data-free pruning strategy with broad applicability and offers insights into how regularization interacts with pruning to drive networks toward low-rank, information-rich representations.

Abstract

Deep neural networks (DNNs) have brought significant advancements in various applications in recent years, such as image recognition, speech recognition, and natural language processing. In particular, Vision Transformers (ViTs) have emerged as a powerful class of models in the field of deep learning for image classification. In this work, we propose a novel Random Matrix Theory (RMT)-based method for pruning pre-trained DNNs, based on the sparsification of weights and singular vectors, and apply it to ViTs. RMT provides a robust framework to analyze the statistical properties of large matrices, which has been shown to be crucial for understanding and optimizing the performance of DNNs. We demonstrate that our RMT-based pruning can be used to reduce the number of parameters of ViT models (trained on ImageNet) by 30-50\% with less than 1\% loss in accuracy. To our knowledge, this represents the state-of-the-art in pruning for these ViT models. Furthermore, we provide a rigorous mathematical underpinning of the above numerical studies, namely we proved a theorem for fully connected DNNs, and other more general DNN structures, describing how the randomness in the weight matrices of a DNN decreases as the weights approach a local or global minimum (during training). We verify this theorem through numerical experiments on fully connected DNNs, providing empirical support for our theoretical findings. Moreover, we prove a theorem that describes how DNN loss decreases as we remove randomness in the weight layers, and show a monotone dependence of the decrease in loss with the amount of randomness that we remove. Our results also provide significant RMT-based insights into the role of regularization during training and pruning.

Pruning Deep Neural Networks via a Combination of the Marchenko-Pastur Distribution and Regularization

TL;DR

The paper tackles the problem of pruning deep neural networks, especially Vision Transformers, by leveraging Random Matrix Theory and the Marchenko-Pastur distribution to separate noise from signal in weight matrices and singular vectors. It introduces a principled MP-based pruning workflow with metrics gamma and mu, achieving substantial parameter reductions (30-50%) with minimal accuracy loss on ImageNet and providing rigorous theory showing that training reduces randomness and that removing randomness lowers loss. The contributions include both practical pruning algorithms and generalized theoretical results (Gaussian and non-Gaussian settings) that connect regularization, spectral properties, and performance. The approach yields a data-free pruning strategy with broad applicability and offers insights into how regularization interacts with pruning to drive networks toward low-rank, information-rich representations.

Abstract

Deep neural networks (DNNs) have brought significant advancements in various applications in recent years, such as image recognition, speech recognition, and natural language processing. In particular, Vision Transformers (ViTs) have emerged as a powerful class of models in the field of deep learning for image classification. In this work, we propose a novel Random Matrix Theory (RMT)-based method for pruning pre-trained DNNs, based on the sparsification of weights and singular vectors, and apply it to ViTs. RMT provides a robust framework to analyze the statistical properties of large matrices, which has been shown to be crucial for understanding and optimizing the performance of DNNs. We demonstrate that our RMT-based pruning can be used to reduce the number of parameters of ViT models (trained on ImageNet) by 30-50\% with less than 1\% loss in accuracy. To our knowledge, this represents the state-of-the-art in pruning for these ViT models. Furthermore, we provide a rigorous mathematical underpinning of the above numerical studies, namely we proved a theorem for fully connected DNNs, and other more general DNN structures, describing how the randomness in the weight matrices of a DNN decreases as the weights approach a local or global minimum (during training). We verify this theorem through numerical experiments on fully connected DNNs, providing empirical support for our theoretical findings. Moreover, we prove a theorem that describes how DNN loss decreases as we remove randomness in the weight layers, and show a monotone dependence of the decrease in loss with the amount of randomness that we remove. Our results also provide significant RMT-based insights into the role of regularization during training and pruning.

Paper Structure

This paper contains 59 sections, 18 theorems, 119 equations, 21 figures, 5 tables, 5 algorithms.

Key Result

Theorem 2.2

Consider an $N\times M$ random matrix $W$ with $M \leq N$. Let the entries $W_{i,j}$ be independent, identically distributed with zero mean and finite variance $\sigma^2$. Let $X = \frac{1}{N} W^T W$. When $N \to \infty$ and $\frac{M}{N} \to c \in (0,+\infty)$, the ESD of $X$, denoted by $\mu_{X_M}$ with

Figures (21)

  • Figure 1: The top 1 accuracy and top-5 accuracy of ViT-base vs. percentage of parameters kept for pruning through RMT-based sparsification and no fine-tuning. The accuracy of the DNN is given as a percentage above the data points.
  • Figure 2: The top 1 accuracy and top-5 accuracy of ViT-large model vs. percentage of parameters kept for pruning through RMT-based sparsification and no fine-tuning. The accuracy of the DNN is given as a percentage above the data points.
  • Figure 3: Comparison of pruning results between our method and CP-ViT song2022cp
  • Figure 4: DNN performance vs. sparsification
  • Figure 5: Accuracy and loss as we add noise to the weight layers of the DNN.
  • ...and 16 more figures

Theorems & Definitions (40)

  • Definition 2.1
  • Theorem 2.2: Marchenko and Pastur (1967) marchenko1967distribution
  • Remark 3.1
  • Remark 4.1
  • Remark 4.2
  • Theorem 4.1
  • Theorem 4.2
  • Corollary 4.3
  • Lemma 4.4
  • Example 5.1
  • ...and 30 more