Compressing Neural Networks Using Tensor Networks with Exponentially Fewer Variational Parameters

Yong Qing; Ke Li; Peng-Fei Zhou; Shi-Ju Ran

Compressing Neural Networks Using Tensor Networks with Exponentially Fewer Variational Parameters

Yong Qing, Ke Li, Peng-Fei Zhou, Shi-Ju Ran

TL;DR

The paper addresses the challenge of massive variational parameters in neural networks by introducing automatically differentiable tensor networks (ADTN) that encode layer parameters into deep tensor-network contractions, reducing parameters from an exponential scale to a near-linear scale while preserving or enhancing accuracy on standard benchmarks. The method combines brick-wall ADTN constructions with a two-stage optimization (Euclidean pre-training followed by task-specific fine-tuning) and demonstrates exceptional compression on networks such as VGG-16, achieving substantial parameter reductions (e.g., from ~$10^7$ to a few hundred parameters) with improved CIFAR-10 accuracy. Key insights include the importance of backward compression order, the beneficial role of deeper ADTN layers for high compression, and the superior representational power of deep TNs over shallow tensor decompositions. The approach offers a practical, scalable framework for compressing neural networks and suggests deep tensor networks as a potent basis for representing variational parameters in modern architectures.

Abstract

Neural network (NN) designed for challenging machine learning tasks is in general a highly nonlinear mapping that contains massive variational parameters. High complexity of NN, if unbounded or unconstrained, might unpredictably cause severe issues including \R{overfitting}, loss of generalization power, and unbearable cost of hardware. In this work, we propose a general compression scheme that significantly reduces the variational parameters of NN's, despite of their specific types (linear, convolutional, \textit{etc}), by encoding them to deep \R{automatically differentiable} tensor network (ADTN) that contains exponentially-fewer free parameters. Superior compression performance of our scheme is demonstrated on several widely-recognized NN's (FC-2, LeNet-5, AlextNet, ZFNet and VGG-16) and datasets (MNIST, CIFAR-10 and CIFAR-100). For instance, we compress two linear layers in VGG-16 with approximately $10^{7}$ parameters to two ADTN's with just 424 parameters, improving the testing accuracy on CIFAR-10 from $90.17\%$ to $91.74\%$. We argue that the deep structure of ADTN is an essential reason for the remarkable compression performance of ADTN, compared to existing compression schemes that are mainly based on tensor decompositions/factorization and shallow tensor networks. Our work suggests deep TN as an exceptionally efficient mathematical structure for representing the variational parameters of NN's, which exhibits superior compressibility over the commonly-used matrices and multi-way arrays.

Compressing Neural Networks Using Tensor Networks with Exponentially Fewer Variational Parameters

TL;DR

to a few hundred parameters) with improved CIFAR-10 accuracy. Key insights include the importance of backward compression order, the beneficial role of deeper ADTN layers for high compression, and the superior representational power of deep TNs over shallow tensor decompositions. The approach offers a practical, scalable framework for compressing neural networks and suggests deep tensor networks as a potent basis for representing variational parameters in modern architectures.

Abstract

parameters to two ADTN's with just 424 parameters, improving the testing accuracy on CIFAR-10 from

. We argue that the deep structure of ADTN is an essential reason for the remarkable compression performance of ADTN, compared to existing compression schemes that are mainly based on tensor decompositions/factorization and shallow tensor networks. Our work suggests deep TN as an exceptionally efficient mathematical structure for representing the variational parameters of NN's, which exhibits superior compressibility over the commonly-used matrices and multi-way arrays.

Paper Structure (15 sections, 9 equations, 7 figures, 10 tables)

This paper contains 15 sections, 9 equations, 7 figures, 10 tables.

Introduction
Methods
Encoding variational parameters of NN's to ADTN's
Optimization of ADTN
Results and Discussions
Compression Performance
Discussions on compression with under/overfitting
Discussions on local-minima problem and compression order
Discussions on Compression Faithfulness
Discussions on Tensor Network: "deep" versus "shallow"
Conclusion
Necessity of pre-training
Hyperparameters of ADTN's for the compressing VGG-16
Basics of tensor network and the contractions
Details of FC-2, LeNet-5, AlexNet, AFNet and VGG-16

Figures (7)

Figure 1: (Color online) The workflow of ADTN for compressing NN. (a) The illustration of a convolutional NN as an example, whose variational parameters ($\boldsymbol{T}$) are encoded in a ADTN shown in (b). The contraction of the ADTN results in $\boldsymbol{T}$, in other words, where the ADTN contains much less parameters than $\boldsymbol{T}$.
Figure 2: (Color online) The performance of ADTN scheme with different numbers of tensor layers ($M$) to compress both the linear and convolutional layers of a reduced version of VGG-16 (with the baseline testing accuracy $\eta_{\text{NN}}=81.14\%$). In the figure at the left side, the red bars indicate the compressed layers, and their lengths illustrate the numbers of parameters, while the green bars represent the uncompressed layers. In the table at the right side, the three and four columns show the training and testing accuracies ($\tilde{\eta}$ and $\eta$, respectively). The last column gives the total compression ratio $\rho_{\text{tot}}$ [Eq. (\ref{['eq-rhot']})].
Figure 3: (Color online) The after-compression testing accuracy $\eta$ on CIFAR-10 versus the inverse compression ratio $1/\rho$, obtained by compressing a simplified VGG-16 using ADTN and MPO (shallow TN). The baseline testing accuracy is $\eta_{\text{NN}} = 0.8114$.
Figure 4: (Color online) (a) The testing-accuracy ratio $\eta/\eta_{\text{NN}}$ versus the inverse total compression ratio $\rho_{\text{tot}}^{-1}$ for the CIFAR-10 dataset by LeNet-5. The dimensions of the first two linear layers are taken as $(512 \times s)$ and $(s \times 128)$, where $s$ is taken as $32, 64, \cdots, 1536$ (see the color bar). $N$ ADTN's are used for compression, where each contains $M=3$ TN layers. (b) The testing-accuracy ratio $\eta/\eta_{\text{NN}}$ versus the inverse of compression ratio $\rho^{-1}$ for the CIFAR-10 dataset by VGG-16. We selected $12$ largest layers (ten convolutional and two linear layers) in VGG-16 and compress them in two different orders: from the input to output (denoted as forward compression) and the other way around (denoted as backward compression).
Figure 5: (Color online) The testing accuracy of NN $\eta_{\text{NN}}$ and that after compression $\eta$ of VGG-16 on the CIFAR-10 dataset with different numbers of training samples. The compression by our ADTN scheme faithfully restores the testing accuracy of the original NN with slight improvements.
...and 2 more figures

Compressing Neural Networks Using Tensor Networks with Exponentially Fewer Variational Parameters

TL;DR

Abstract

Compressing Neural Networks Using Tensor Networks with Exponentially Fewer Variational Parameters

Authors

TL;DR

Abstract

Table of Contents

Figures (7)