Table of Contents
Fetching ...

"Lossless" Compression of Deep Neural Networks: A High-dimensional Neural Tangent Kernel Approach

Lingyu Gu, Yongqi Du, Yuan Zhang, Di Xie, Shiliang Pu, Robert C. Qiu, Zhenyu Liao

TL;DR

This paper demonstrates that in the high-dimensional regime where the number of data points $n$ and their dimension $p$ are both large, and under a Gaussian mixture model for the data, there existsasymptotic spectral equivalence between the NTK matrices for a large family of DNN models.

Abstract

Modern deep neural networks (DNNs) are extremely powerful; however, this comes at the price of increased depth and having more parameters per layer, making their training and inference more computationally challenging. In an attempt to address this key limitation, efforts have been devoted to the compression (e.g., sparsification and/or quantization) of these large-scale machine learning models, so that they can be deployed on low-power IoT devices. In this paper, building upon recent advances in neural tangent kernel (NTK) and random matrix theory (RMT), we provide a novel compression approach to wide and fully-connected \emph{deep} neural nets. Specifically, we demonstrate that in the high-dimensional regime where the number of data points $n$ and their dimension $p$ are both large, and under a Gaussian mixture model for the data, there exists \emph{asymptotic spectral equivalence} between the NTK matrices for a large family of DNN models. This theoretical result enables "lossless" compression of a given DNN to be performed, in the sense that the compressed network yields asymptotically the same NTK as the original (dense and unquantized) network, with its weights and activations taking values \emph{only} in $\{ 0, \pm 1 \}$ up to a scaling. Experiments on both synthetic and real-world data are conducted to support the advantages of the proposed compression scheme, with code available at \url{https://github.com/Model-Compression/Lossless_Compression}.

"Lossless" Compression of Deep Neural Networks: A High-dimensional Neural Tangent Kernel Approach

TL;DR

This paper demonstrates that in the high-dimensional regime where the number of data points and their dimension are both large, and under a Gaussian mixture model for the data, there existsasymptotic spectral equivalence between the NTK matrices for a large family of DNN models.

Abstract

Modern deep neural networks (DNNs) are extremely powerful; however, this comes at the price of increased depth and having more parameters per layer, making their training and inference more computationally challenging. In an attempt to address this key limitation, efforts have been devoted to the compression (e.g., sparsification and/or quantization) of these large-scale machine learning models, so that they can be deployed on low-power IoT devices. In this paper, building upon recent advances in neural tangent kernel (NTK) and random matrix theory (RMT), we provide a novel compression approach to wide and fully-connected \emph{deep} neural nets. Specifically, we demonstrate that in the high-dimensional regime where the number of data points and their dimension are both large, and under a Gaussian mixture model for the data, there exists \emph{asymptotic spectral equivalence} between the NTK matrices for a large family of DNN models. This theoretical result enables "lossless" compression of a given DNN to be performed, in the sense that the compressed network yields asymptotically the same NTK as the original (dense and unquantized) network, with its weights and activations taking values \emph{only} in up to a scaling. Experiments on both synthetic and real-world data are conducted to support the advantages of the proposed compression scheme, with code available at \url{https://github.com/Model-Compression/Lossless_Compression}.
Paper Structure (27 sections, 6 theorems, 121 equations, 4 figures, 2 tables, 1 algorithm)

This paper contains 27 sections, 6 theorems, 121 equations, 4 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

Let Assumptions ass:high-dimen--ass:activation hold, and let $\tau_0, \tau_1, \ldots, \tau_L \geq 0$ be a sequence of non-negative numbers satisfying the following recursion: Further assume that the activation functions $\sigma_\ell(\cdot)$s are "centered," such that ${\mathbb{E}}[\sigma_\ell(\tau_{\ell-1} \xi)] = 0$. Then, for the CK matrix $\mathbf{K}_{\mathop{\mathrm{CK}}\nolimits,\ell}$ of la

Figures (4)

  • Figure 1: Visual representations of activations $\sigma_T$ and $\sigma_Q$ in \ref{['eq:def_sigmal']}(left) and the expressions of ${\mathbb{E}}[\sigma_T(\tau\xi)]$ and ${\mathbb{E}}[\sigma_Q(\tau\xi)]$(right), with $r_1 - r_2 = r_3 - r_4$ here and $\mathop{\mathrm{erf}}\limits(\cdot)$ the Gaussian error function.
  • Figure 2: Eigenvalue histograms (top) and dominant eigenvectors (bottom) of last-layer CK matrices $\mathbf{K}_{\mathop{\mathrm{CK}}\nolimits}$ (blue) defined in \ref{['eq:def_K_CK_K_NTK']} (with expectation estimated from $1\,000$ independent realizations of $\mathbf{W}$s) and the asymptotic equivalent $\tilde{\mathbf{K}}_{\mathop{\mathrm{CK}}\nolimits}$ (red) matrices. (Left) Gaussian $\mathbf{W}$ on two-class GMM data, with $p=2\,000$, $n=8\,000$, $\boldsymbol{\mu}_a=[\mathbf{0}_{8(a-1)};~8;~\mathbf{0}_{p-8a+7}], \mathbf{C}_a=(1+8(a-1)/\sqrt{p})\mathbf{I}_p$, $a \in \{ 1,2\}$ using $[\mathop{\mathrm{ReLU}}\nolimits,~\mathop{\mathrm{ReLU}}\nolimits,~\mathop{\mathrm{ReLU}}\nolimits]$ activations, here $\| \mathbf{K}_{\mathop{\mathrm{CK}}\nolimits} - \tilde{\mathbf{K}}_{\mathop{\mathrm{CK}}\nolimits} \| = 0.15$; and (right) symmetric Bernoulli $\mathbf{W}$ on MNIST data (number $6$ versus $8$) lecun1998gradient, with $p=784$, $n=3\,200$, using $[\mathop{\mathrm{poly}}\nolimits,~\mathop{\mathrm{ReLU}}\nolimits,~\mathop{\mathrm{ReLU}}\nolimits]$ activations, $\| \mathbf{K}_{\mathop{\mathrm{CK}}\nolimits} - \tilde{\mathbf{K}}_{\mathop{\mathrm{CK}}\nolimits} \| = 6.86$. $\mathbf{x}_1, \ldots, \mathbf{x}_{n/2} \in \mathcal{C}_1$ and $\mathbf{x}_{n/2+1}, \ldots, \mathbf{x}_n \in \mathcal{C}_2$ in both cases.
  • Figure 3: Classification accuracies of different compressed fully-connected nets on MNIST lecun1998gradient (top) and CIFAR10 krizhevsky2009learning (bottom) datasets. Blue curves represent the proposed compression approach with different levels of sparsity $\varepsilon \in \{ 0\%, 50\%, 90\% \}$, purple curves represent the heuristic sparsification approach by uniformly zeroing out $80\%$ of the weights, green curves represent the heuristic quantization approach using the binary activation $\sigma (t) = 1_{t < -1}+ 1_{t > 1}$ , red curves represent the original network, brown curves represent the proposed compression approach without activation quantization, with $\varepsilon=90\%$ for MNIST (top) and $\varepsilon=95\%$ for CIFAR10 (bottom), and orange curves represent magnitude-based pruning gale2019state with the same sparsity level $\varepsilon$ as brown. Memory varies due to the change of layer width of the network.
  • Figure 4: Test accuracy of classification on 2-class (top) MNIST dataset - digits $6$ versus $8$ and 5-class (bottom) MNIST dataset - digits $(0,1,2,3,4)$. Blue curves represent the proposed "lossless" compression scheme with different levels of sparsity $\varepsilon \in \{ 0\%, 50\%, 90\% \}$, purple curves represent the heuristic sparsification approach by uniformly zeroing out $90\%$ of the weights, green curves represent the heuristic quantization approach using the binary activation $\sigma (t) = 1_{t < -1}+ 1_{t > 1}$ (only applied on the first two layers, otherwise the performance is too poor to be compared to other curves), and red curves represent the original (dense and unquantized) network. All nets have three fully-connected layers, and the original network uses $\mathop{\mathrm{ReLU}}\nolimits$ activations for all layers. Memory varies due to the change of layer width of the network.

Theorems & Definitions (12)

  • Theorem 1: Asymptotic spectral equivalents for CK matrices
  • Remark 1: On activation centering
  • Theorem 2: Asymptotic spectral equivalent for NTK matrices
  • Remark 2: On spectral norm characterization
  • Remark 3: On CK and NTK matrices
  • Corollary 1: Sparse and quantized DNNs
  • Remark 4: Beyond Gaussian mixture data
  • Remark 5
  • Lemma 1
  • Lemma 2: Consistent estimation of $\tau_0$
  • ...and 2 more