Table of Contents
Fetching ...

Tensor network compressibility of convolutional models

Sukhbinder Singh, Saeed S. Jahromi, Roman Orus

TL;DR

This work investigates why tensorized CNNs retain accuracy by examining how truncating dense convolution kernels affects performance in vanilla CNNs and ResNet-50 trained on CIFAR-10/100. By applying single- and two-mode SVD-based truncations and CP-based truncations, and by analyzing kernel spectra and entanglement-like correlations, the authors show that many truncations incur large kernel-norm loss with only modest accuracy degradation, especially in deeper layers. They further demonstrate that aggressively truncated models can rapidly recover pre-truncation accuracy after a few epochs of retraining, implying that such truncations do not drive the model to poor minima. These findings support the view that correlation compression is intrinsic to how information is encoded in dense CNNs and bolster practical approaches for tensorizing and compressing convolutional networks.

Abstract

Convolutional neural networks (CNNs) are one of the most widely used neural network architectures, showcasing state-of-the-art performance in computer vision tasks. Although larger CNNs generally exhibit higher accuracy, their size can be effectively reduced by ``tensorization'' while maintaining accuracy, namely, replacing the convolution kernels with compact decompositions such as Tucker, Canonical Polyadic decompositions, or quantum-inspired decompositions such as matrix product states, and directly training the factors in the decompositions to bias the learning towards low-rank decompositions. But why doesn't tensorization seem to impact the accuracy adversely? We explore this by assessing how \textit{truncating} the convolution kernels of \textit{dense} (untensorized) CNNs impact their accuracy. Specifically, we truncated the kernels of (i) a vanilla four-layer CNN and (ii) ResNet-50 pre-trained for image classification on CIFAR-10 and CIFAR-100 datasets. We found that kernels (especially those inside deeper layers) could often be truncated along several cuts resulting in significant loss in kernel norm but not in classification accuracy. This suggests that such ``correlation compression'' (underlying tensorization) is an intrinsic feature of how information is encoded in dense CNNs. We also found that aggressively truncated models could often recover the pre-truncation accuracy after only a few epochs of re-training, suggesting that compressing the internal correlations of convolution layers does not often transport the model to a worse minimum. Our results can be applied to tensorize and compress CNN models more effectively.

Tensor network compressibility of convolutional models

TL;DR

This work investigates why tensorized CNNs retain accuracy by examining how truncating dense convolution kernels affects performance in vanilla CNNs and ResNet-50 trained on CIFAR-10/100. By applying single- and two-mode SVD-based truncations and CP-based truncations, and by analyzing kernel spectra and entanglement-like correlations, the authors show that many truncations incur large kernel-norm loss with only modest accuracy degradation, especially in deeper layers. They further demonstrate that aggressively truncated models can rapidly recover pre-truncation accuracy after a few epochs of retraining, implying that such truncations do not drive the model to poor minima. These findings support the view that correlation compression is intrinsic to how information is encoded in dense CNNs and bolster practical approaches for tensorizing and compressing convolutional networks.

Abstract

Convolutional neural networks (CNNs) are one of the most widely used neural network architectures, showcasing state-of-the-art performance in computer vision tasks. Although larger CNNs generally exhibit higher accuracy, their size can be effectively reduced by ``tensorization'' while maintaining accuracy, namely, replacing the convolution kernels with compact decompositions such as Tucker, Canonical Polyadic decompositions, or quantum-inspired decompositions such as matrix product states, and directly training the factors in the decompositions to bias the learning towards low-rank decompositions. But why doesn't tensorization seem to impact the accuracy adversely? We explore this by assessing how \textit{truncating} the convolution kernels of \textit{dense} (untensorized) CNNs impact their accuracy. Specifically, we truncated the kernels of (i) a vanilla four-layer CNN and (ii) ResNet-50 pre-trained for image classification on CIFAR-10 and CIFAR-100 datasets. We found that kernels (especially those inside deeper layers) could often be truncated along several cuts resulting in significant loss in kernel norm but not in classification accuracy. This suggests that such ``correlation compression'' (underlying tensorization) is an intrinsic feature of how information is encoded in dense CNNs. We also found that aggressively truncated models could often recover the pre-truncation accuracy after only a few epochs of re-training, suggesting that compressing the internal correlations of convolution layers does not often transport the model to a worse minimum. Our results can be applied to tensorize and compress CNN models more effectively.
Paper Structure (30 sections, 27 equations, 21 figures, 2 tables)

This paper contains 30 sections, 27 equations, 21 figures, 2 tables.

Figures (21)

  • Figure 1: (i) The schematic of a CNN architecture, composed of a feature extractor, classifier, and a Softmax non-linearity that converts the classifier's output into probabilities. (ii) Repeated basic block of operations ($N$ times) that compose the feature extractor. (iii) The basic block of operations repeated ($M$ times) to compose the classifier. While ReLu is a common choice of non-linearity in CNNs, other non-linear functions such as hyperbolic tangent and sigmoid are also used.
  • Figure 2: (i) The input image as a 3-index tensor $I$. (ii) The "patch image" tensor $I^\square$ obtained from the input image either by reshaping indices (for simple convolutions) or applying a more general shuffling transformation called im2col() Im2colIm2colMatLab. (iii) The convolution kernel $K$ as a 4-index tensor. (iv) Convolution on an image as a contraction of the patch image tensor $I^\square$ with the convolution kernel tensor $K$, producing an output feature image $I'$ConvTN. A non-linearity such as ReLu (not shown here) is applied on $I'$ to obtain a transformed image. (v) The resulting feature image $I'$ is reorganized into a patch image $I'{}^\square$. Then average pooling can be understood as a tensor contraction of $I'{}^\square$ with a vector whose components are all ones and another vector whose components are all $\alpha$ [see Fig. \ref{['fig:tensors']}(v)].
  • Figure 3: Popular tensor network decompositions of the convolution kernel. (i) Tucker decomposition. (ii) CP decomposition. Matrix Product State-based decompositions: (iii) Tensor Train (MPS with open boundary condition) and (iv) Tensor Ring (MPS with periodic boundary condition). Examples of structured convolutions (Adapted from StructuredConvolutions): (v) a convolution kernel comprised of rank-1 filters, namely, spatially separable convolution), and (vi) a depthwise separable convolution kernel underlying successful CNNs such as Xception Xception and MobileNet MobileNet.
  • Figure 4: (i) The HOSVD decomposition of the convolution kernel. $S^X, S^Y, S^{\hbox{\tiny in}}$ and $S^{\hbox{\tiny out}}$ are diagonal matrices whose diagonal entries are the singular values. We also refer to these as the single-mode singular values of the kernel, corresponding to the four modes (indices) of the kernel. (ii) The mode matrices $U^X, U^Y, U^{\hbox{\tiny in}}$, and $U^{\hbox{\tiny out}}$ are orthogonal, fulfilling, $U^X (U^X)^T = I_{|\alpha|}$, and so on. Here, ${(.)}^T$ denotes matrix transposition. The size of the identity matrix depicted on the right is equal to the HOSVD rank of that mode. (iii) The part of the HOSVD obtained by discarding any orthogonal mode matrix is an isometry; namely, it fulfills an identity similar to the one shown in this panel for mode $x$.
  • Figure 5: All the bipartitions of the kernel modes (indices) considered in this paper for truncation. Single-mode bipartitions labeled (i) KW, (ii) KH, (iii) OUT, and (iv) IN. Two-mode bipartitions labeled (v) OUT, IN, (vi) OUT, KW, and (vii) OUT, KH. For each picture, the kernel can be transformed into a corresponding matrix by bending, crossing, and grouping indices, as shown.
  • ...and 16 more figures