DepthShrinker: A New Compression Paradigm Towards Boosting Real-Hardware Efficiency of Compact Neural Networks

Yonggan Fu; Haichuan Yang; Jiayi Yuan; Meng Li; Cheng Wan; Raghuraman Krishnamoorthi; Vikas Chandra; Yingyan Celine Lin

DepthShrinker: A New Compression Paradigm Towards Boosting Real-Hardware Efficiency of Compact Neural Networks

Yonggan Fu, Haichuan Yang, Jiayi Yuan, Meng Li, Cheng Wan, Raghuraman Krishnamoorthi, Vikas Chandra, Yingyan Celine Lin

TL;DR

DepthShrinker tackles the gap between theoretical compactness and real-hardware efficiency in DNNs by removing redundant activations and merging consecutive linear layers into dense convolutions, significantly improving hardware utilization with minimal accuracy loss. The framework employs a differentiable activation-pruning search, stage-wise fine-tuning with optional free activations and self-distillation, and a merger mechanism that converts blocks into single dense operators with kernel size $d = d_1 + d_2 - 1$. Its DepthShrinker family and DepthShrinker$^+$ training strategy demonstrate superior accuracy-throughput trade-offs against SOTA pruning and efficient DNNs across multiple devices and models, including MobileNetV2 and EfficientNet-Lite variants. This work offers a practical, hardware-aware compression paradigm that leverages dense computation's parallelism to unlock real-world efficiency gains, with potential extensions to NAS and broader hardware targets.

Abstract

Efficient deep neural network (DNN) models equipped with compact operators (e.g., depthwise convolutions) have shown great potential in reducing DNNs' theoretical complexity (e.g., the total number of weights/operations) while maintaining a decent model accuracy. However, existing efficient DNNs are still limited in fulfilling their promise in boosting real-hardware efficiency, due to their commonly adopted compact operators' low hardware utilization. In this work, we open up a new compression paradigm for developing real-hardware efficient DNNs, leading to boosted hardware efficiency while maintaining model accuracy. Interestingly, we observe that while some DNN layers' activation functions help DNNs' training optimization and achievable accuracy, they can be properly removed after training without compromising the model accuracy. Inspired by this observation, we propose a framework dubbed DepthShrinker, which develops hardware-friendly compact networks via shrinking the basic building blocks of existing efficient DNNs that feature irregular computation patterns into dense ones with much improved hardware utilization and thus real-hardware efficiency. Excitingly, our DepthShrinker framework delivers hardware-friendly compact networks that outperform both state-of-the-art efficient DNNs and compression techniques, e.g., a 3.06% higher accuracy and 1.53$\times$ throughput on Tesla V100 over SOTA channel-wise pruning method MetaPruning. Our codes are available at: https://github.com/facebookresearch/DepthShrinker.

DepthShrinker: A New Compression Paradigm Towards Boosting Real-Hardware Efficiency of Compact Neural Networks

TL;DR

. Its DepthShrinker family and DepthShrinker

training strategy demonstrate superior accuracy-throughput trade-offs against SOTA pruning and efficient DNNs across multiple devices and models, including MobileNetV2 and EfficientNet-Lite variants. This work offers a practical, hardware-aware compression paradigm that leverages dense computation's parallelism to unlock real-world efficiency gains, with potential extensions to NAS and broader hardware targets.

Abstract

throughput on Tesla V100 over SOTA channel-wise pruning method MetaPruning. Our codes are available at: https://github.com/facebookresearch/DepthShrinker.

Paper Structure (23 sections, 5 equations, 7 figures, 9 tables)

This paper contains 23 sections, 5 equations, 7 figures, 9 tables.

Introduction
Related Works
Motivating Inspiration and Observations
Inspiration Drawn from Previous Works
Motivating Observations from Real-device Profiling
The Proposed DepthShrinker Framework
Overview
Stage 1: Identify Redundant Activation Functions
Stage 2: How to Fine-tune
Stage 3: How to Merge
DepthShrinker$^+$: Expand-then-Shrink
Experiment Results
Experiment Setup
Benchmark with SOTA Pruning Methods
Benchmark with SOTA Efficient DNNs
...and 8 more sections

Figures (7)

Figure 1: Visualizing the block-wise latency of the inverted residual blocks (a total of 17) in MobileNetV2/MobileNetV2-1.4 (solid lines) and their corresponding dense convolutions (dashed lines) on an RTX 2080Ti GPU. "MBV2" denotes MobileNetV2.
Figure 2: Overview of our DepthShrinker framework and its three-stage design. "PW" and "DW" denote pointwise/depthwise convolutions, respectively. During merging, we merge the two pointwise convolutions and one depthwise convolution in blocks whose activation functions are removed, into one dense convolution.
Figure 3: Benchmark DepthShrinker (solid line) with layer-wise pruning (dashed line) on top of three models in terms of FPS measured on an RTX 2080Ti GPU. "MBV2" and "Efflite0" denote MobileNetV2 and EfficientNet-Lite0, respectively.
Figure 4: Visualizing the block-wise latency of the blocks in MobileNetV2-1.4 (solid lines) and their merged counterparts (dashed lines) on an RTX 2080Ti GPU. We also annotate blocks where the activation functions are remained, using different symbols for the three model variants delivered by DepthShrinker.
Figure 5: Visualizing the memory footprint, including both that of weights and peak activation maps, of MobileNetV2-1.4 before and after applying DepthShrinker.
...and 2 more figures

DepthShrinker: A New Compression Paradigm Towards Boosting Real-Hardware Efficiency of Compact Neural Networks

TL;DR

Abstract

DepthShrinker: A New Compression Paradigm Towards Boosting Real-Hardware Efficiency of Compact Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Figures (7)