"Lossless" Compression of Deep Neural Networks: A High-dimensional Neural Tangent Kernel Approach

Lingyu Gu; Yongqi Du; Yuan Zhang; Di Xie; Shiliang Pu; Robert C. Qiu; Zhenyu Liao

"Lossless" Compression of Deep Neural Networks: A High-dimensional Neural Tangent Kernel Approach

Lingyu Gu, Yongqi Du, Yuan Zhang, Di Xie, Shiliang Pu, Robert C. Qiu, Zhenyu Liao

TL;DR

This paper demonstrates that in the high-dimensional regime where the number of data points $n$ and their dimension $p$ are both large, and under a Gaussian mixture model for the data, there existsasymptotic spectral equivalence between the NTK matrices for a large family of DNN models.

Abstract

Modern deep neural networks (DNNs) are extremely powerful; however, this comes at the price of increased depth and having more parameters per layer, making their training and inference more computationally challenging. In an attempt to address this key limitation, efforts have been devoted to the compression (e.g., sparsification and/or quantization) of these large-scale machine learning models, so that they can be deployed on low-power IoT devices. In this paper, building upon recent advances in neural tangent kernel (NTK) and random matrix theory (RMT), we provide a novel compression approach to wide and fully-connected \emph{deep} neural nets. Specifically, we demonstrate that in the high-dimensional regime where the number of data points $n$ and their dimension $p$ are both large, and under a Gaussian mixture model for the data, there exists \emph{asymptotic spectral equivalence} between the NTK matrices for a large family of DNN models. This theoretical result enables "lossless" compression of a given DNN to be performed, in the sense that the compressed network yields asymptotically the same NTK as the original (dense and unquantized) network, with its weights and activations taking values \emph{only} in $\{ 0, \pm 1 \}$ up to a scaling. Experiments on both synthetic and real-world data are conducted to support the advantages of the proposed compression scheme, with code available at \url{https://github.com/Model-Compression/Lossless_Compression}.

"Lossless" Compression of Deep Neural Networks: A High-dimensional Neural Tangent Kernel Approach

TL;DR

This paper demonstrates that in the high-dimensional regime where the number of data points

and their dimension

are both large, and under a Gaussian mixture model for the data, there existsasymptotic spectral equivalence between the NTK matrices for a large family of DNN models.

Abstract

and their dimension

are both large, and under a Gaussian mixture model for the data, there exists \emph{asymptotic spectral equivalence} between the NTK matrices for a large family of DNN models. This theoretical result enables "lossless" compression of a given DNN to be performed, in the sense that the compressed network yields asymptotically the same NTK as the original (dense and unquantized) network, with its weights and activations taking values \emph{only} in

up to a scaling. Experiments on both synthetic and real-world data are conducted to support the advantages of the proposed compression scheme, with code available at \url{https://github.com/Model-Compression/Lossless_Compression}.

Paper Structure (27 sections, 6 theorems, 121 equations, 4 figures, 2 tables, 1 algorithm)

This paper contains 27 sections, 6 theorems, 121 equations, 4 figures, 2 tables, 1 algorithm.

Introduction
Our contributions
Related work
Notations and organization of the paper
Preliminaries
Main results
Numerical experiments
Conclusion and perspectives
Proofs of theorems and auxiliary results
Proof of Theorem \ref{['theo:CK']}
Setup and notations
Proof of Lemma \ref{['lem:entry-wise-approx-CK-center']}
On the diagonal.
Off the diagonal.
Proof of Theorem \ref{['theo:NTK']}
...and 12 more sections

Key Result

Theorem 1

Let Assumptions ass:high-dimen--ass:activation hold, and let $\tau_0, \tau_1, \ldots, \tau_L \geq 0$ be a sequence of non-negative numbers satisfying the following recursion: Further assume that the activation functions $\sigma_\ell(\cdot)$s are "centered," such that ${\mathbb{E}}[\sigma_\ell(\tau_{\ell-1} \xi)] = 0$. Then, for the CK matrix $\mathbf{K}_{\mathop{\mathrm{CK}}\nolimits,\ell}$ of la

Figures (4)

Figure 1: Visual representations of activations $\sigma_T$ and $\sigma_Q$ in \ref{['eq:def_sigmal']}(left) and the expressions of ${\mathbb{E}}[\sigma_T(\tau\xi)]$ and ${\mathbb{E}}[\sigma_Q(\tau\xi)]$(right), with $r_1 - r_2 = r_3 - r_4$ here and $\mathop{\mathrm{erf}}\limits(\cdot)$ the Gaussian error function.
Figure 2: Eigenvalue histograms (top) and dominant eigenvectors (bottom) of last-layer CK matrices $\mathbf{K}_{\mathop{\mathrm{CK}}\nolimits}$ (blue) defined in \ref{['eq:def_K_CK_K_NTK']} (with expectation estimated from $1\,000$ independent realizations of $\mathbf{W}$s) and the asymptotic equivalent $\tilde{\mathbf{K}}_{\mathop{\mathrm{CK}}\nolimits}$ (red) matrices. (Left) Gaussian $\mathbf{W}$ on two-class GMM data, with $p=2\,000$, $n=8\,000$, $\boldsymbol{\mu}_a=[\mathbf{0}_{8(a-1)};~8;~\mathbf{0}_{p-8a+7}], \mathbf{C}_a=(1+8(a-1)/\sqrt{p})\mathbf{I}_p$, $a \in \{ 1,2\}$ using $[\mathop{\mathrm{ReLU}}\nolimits,~\mathop{\mathrm{ReLU}}\nolimits,~\mathop{\mathrm{ReLU}}\nolimits]$ activations, here $\| \mathbf{K}_{\mathop{\mathrm{CK}}\nolimits} - \tilde{\mathbf{K}}_{\mathop{\mathrm{CK}}\nolimits} \| = 0.15$; and (right) symmetric Bernoulli $\mathbf{W}$ on MNIST data (number $6$ versus $8$) lecun1998gradient, with $p=784$, $n=3\,200$, using $[\mathop{\mathrm{poly}}\nolimits,~\mathop{\mathrm{ReLU}}\nolimits,~\mathop{\mathrm{ReLU}}\nolimits]$ activations, $\| \mathbf{K}_{\mathop{\mathrm{CK}}\nolimits} - \tilde{\mathbf{K}}_{\mathop{\mathrm{CK}}\nolimits} \| = 6.86$. $\mathbf{x}_1, \ldots, \mathbf{x}_{n/2} \in \mathcal{C}_1$ and $\mathbf{x}_{n/2+1}, \ldots, \mathbf{x}_n \in \mathcal{C}_2$ in both cases.
Figure 3: Classification accuracies of different compressed fully-connected nets on MNIST lecun1998gradient (top) and CIFAR10 krizhevsky2009learning (bottom) datasets. Blue curves represent the proposed compression approach with different levels of sparsity $\varepsilon \in \{ 0\%, 50\%, 90\% \}$, purple curves represent the heuristic sparsification approach by uniformly zeroing out $80\%$ of the weights, green curves represent the heuristic quantization approach using the binary activation $\sigma (t) = 1_{t < -1}+ 1_{t > 1}$ , red curves represent the original network, brown curves represent the proposed compression approach without activation quantization, with $\varepsilon=90\%$ for MNIST (top) and $\varepsilon=95\%$ for CIFAR10 (bottom), and orange curves represent magnitude-based pruning gale2019state with the same sparsity level $\varepsilon$ as brown. Memory varies due to the change of layer width of the network.
Figure 4: Test accuracy of classification on 2-class (top) MNIST dataset - digits $6$ versus $8$ and 5-class (bottom) MNIST dataset - digits $(0,1,2,3,4)$. Blue curves represent the proposed "lossless" compression scheme with different levels of sparsity $\varepsilon \in \{ 0\%, 50\%, 90\% \}$, purple curves represent the heuristic sparsification approach by uniformly zeroing out $90\%$ of the weights, green curves represent the heuristic quantization approach using the binary activation $\sigma (t) = 1_{t < -1}+ 1_{t > 1}$ (only applied on the first two layers, otherwise the performance is too poor to be compared to other curves), and red curves represent the original (dense and unquantized) network. All nets have three fully-connected layers, and the original network uses $\mathop{\mathrm{ReLU}}\nolimits$ activations for all layers. Memory varies due to the change of layer width of the network.

Theorems & Definitions (12)

Theorem 1: Asymptotic spectral equivalents for CK matrices
Remark 1: On activation centering
Theorem 2: Asymptotic spectral equivalent for NTK matrices
Remark 2: On spectral norm characterization
Remark 3: On CK and NTK matrices
Corollary 1: Sparse and quantized DNNs
Remark 4: Beyond Gaussian mixture data
Remark 5
Lemma 1
Lemma 2: Consistent estimation of $\tau_0$
...and 2 more

"Lossless" Compression of Deep Neural Networks: A High-dimensional Neural Tangent Kernel Approach

TL;DR

Abstract

"Lossless" Compression of Deep Neural Networks: A High-dimensional Neural Tangent Kernel Approach

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (12)