Table of Contents
Fetching ...

Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications

Yong-Deok Kim, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu Yang, Dongjun Shin

TL;DR

This work addresses the challenge of deploying deep CNNs on resource-constrained mobile devices by proposing a one-shot whole-network compression that combines global VBMF-based rank selection with Tucker decomposition of kernel tensors and a final fine-tuning stage. The method compresses entire networks (e.g., AlexNet, VGG-S, GoogLeNet, VGG-16), achieving major reductions in parameter count and FLOPs while maintaining competitive accuracy after fine-tuning, and providing notable energy and runtime gains on mobile hardware. Key insights include the effective use of Tucker-2 for mid/high-complexity kernels, automatic per-layer rank determination via VBMF, and practical considerations of $1\times1$ convolutions affecting cache efficiency and energy. Overall, the approach enables faster, energy-efficient mobile inference with publicly available tools and techniques, illustrating a viable path toward on-device deployment of deep CNNs. This has significant implications for real-time mobile vision tasks and energy-constrained AI applications.

Abstract

Although the latest high-end smartphone has powerful CPU and GPU, running deeper convolutional neural networks (CNNs) for complex tasks such as ImageNet classification on mobile devices is challenging. To deploy deep CNNs on mobile devices, we present a simple and effective scheme to compress the entire CNN, which we call one-shot whole network compression. The proposed scheme consists of three steps: (1) rank selection with variational Bayesian matrix factorization, (2) Tucker decomposition on kernel tensor, and (3) fine-tuning to recover accumulated loss of accuracy, and each step can be easily implemented using publicly available tools. We demonstrate the effectiveness of the proposed scheme by testing the performance of various compressed CNNs (AlexNet, VGGS, GoogLeNet, and VGG-16) on the smartphone. Significant reductions in model size, runtime, and energy consumption are obtained, at the cost of small loss in accuracy. In addition, we address the important implementation level issue on 1?1 convolution, which is a key operation of inception module of GoogLeNet as well as CNNs compressed by our proposed scheme.

Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications

TL;DR

This work addresses the challenge of deploying deep CNNs on resource-constrained mobile devices by proposing a one-shot whole-network compression that combines global VBMF-based rank selection with Tucker decomposition of kernel tensors and a final fine-tuning stage. The method compresses entire networks (e.g., AlexNet, VGG-S, GoogLeNet, VGG-16), achieving major reductions in parameter count and FLOPs while maintaining competitive accuracy after fine-tuning, and providing notable energy and runtime gains on mobile hardware. Key insights include the effective use of Tucker-2 for mid/high-complexity kernels, automatic per-layer rank determination via VBMF, and practical considerations of convolutions affecting cache efficiency and energy. Overall, the approach enables faster, energy-efficient mobile inference with publicly available tools and techniques, illustrating a viable path toward on-device deployment of deep CNNs. This has significant implications for real-time mobile vision tasks and energy-constrained AI applications.

Abstract

Although the latest high-end smartphone has powerful CPU and GPU, running deeper convolutional neural networks (CNNs) for complex tasks such as ImageNet classification on mobile devices is challenging. To deploy deep CNNs on mobile devices, we present a simple and effective scheme to compress the entire CNN, which we call one-shot whole network compression. The proposed scheme consists of three steps: (1) rank selection with variational Bayesian matrix factorization, (2) Tucker decomposition on kernel tensor, and (3) fine-tuning to recover accumulated loss of accuracy, and each step can be easily implemented using publicly available tools. We demonstrate the effectiveness of the proposed scheme by testing the performance of various compressed CNNs (AlexNet, VGGS, GoogLeNet, and VGG-16) on the smartphone. Significant reductions in model size, runtime, and energy consumption are obtained, at the cost of small loss in accuracy. In addition, we address the important implementation level issue on 1?1 convolution, which is a key operation of inception module of GoogLeNet as well as CNNs compressed by our proposed scheme.

Paper Structure

This paper contains 17 sections, 5 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Mode-1 (top left), mode-2 (top right), and mode-3 (bottom left) matricization of the 3-way tensor. They are constructed by concatenation of frontal, horizontal, and vertical slices, respectively. (Bottom right): Illustration of 3-way Tucker decomposition. The original tensor ${\mathcal{X}}$ of size $I_1 \times I_2 \times I_3$ is decomposed to the product of the core tensor ${\mathcal{S}}$ of size $J_1 \times J_2 \times J_3$ and factor matrices $\boldsymbol{A}^{(1)}$, $\boldsymbol{A}^{(2)}$, and $\boldsymbol{A}^{(3)}$.
  • Figure 2: Our one-shot whole network compression scheme consists of (1) rank selection with VBMF; (2) Tucker decomposition on kernel tensor; (3) fine-tuning of entire network. Note that Tucker-2 decomposition is applied from the second convolutional layer to the first fully connected layers, and Tucker-1 decomposition to the other layers.
  • Figure 3: Tucker-2 decompositions for speeding-up a convolution. Each transparent box corresponds to 3-way tensor ${\mathcal{X}}$, ${\mathcal{Z}}$, ${\mathcal{Z}}'$, and ${\mathcal{Y}}$ in (\ref{['eq:conva']}-\ref{['eq:convc']}), with two frontal sides corresponding to spatial dimensions. Arrows represent linear mappings and illustrate how scalar values on the right are computed. Yellow tube, red box, and blue tube correspond to $1\times 1$, $D \times D$, and $1\times 1$ convolution in \ref{['eq:conva']}, \ref{['eq:convb']}, and \ref{['eq:convc']} respectively.
  • Figure 4: Accuracy of compressed CNNs in fine-tuning.
  • Figure 5: Power consumption over time for each model. (Blue: GPU, Red: main memory).
  • ...and 3 more figures