Table of Contents
Fetching ...

LayerMerge: Neural Network Depth Compression through Layer Pruning and Merging

Jinuk Kim, Marwa El Halabi, Mingi Ji, Hyun Oh Song

TL;DR

LayerMerge addresses the latency penalties of depth compression by jointly pruning activation and convolution layers and solving a surrogate optimization with dynamic programming. It introduces per-block importance and latency measures to select which layers to remove under a latency budget, enabling exact DP solutions with complexity on the order of O(L^2 P K0). Empirically, LayerMerge consistently outperforms existing depth compression and layer-pruning baselines on CNNs (ResNet-34, MobileNetV2) and diffusion-model generators (DDPM), with notable speed-ups and minimal accuracy loss, and it can complement channel-pruning approaches. The work provides a practical, hardware-aware framework for depth reduction with broad applicability and released code for reproducibility.

Abstract

Recent works show that reducing the number of layers in a convolutional neural network can enhance efficiency while maintaining the performance of the network. Existing depth compression methods remove redundant non-linear activation functions and merge the consecutive convolution layers into a single layer. However, these methods suffer from a critical drawback; the kernel size of the merged layers becomes larger, significantly undermining the latency reduction gained from reducing the depth of the network. We show that this problem can be addressed by jointly pruning convolution layers and activation functions. To this end, we propose LayerMerge, a novel depth compression method that selects which activation layers and convolution layers to remove, to achieve a desired inference speed-up while minimizing performance loss. Since the corresponding selection problem involves an exponential search space, we formulate a novel surrogate optimization problem and efficiently solve it via dynamic programming. Empirical results demonstrate that our method consistently outperforms existing depth compression and layer pruning methods on various network architectures, both on image classification and generation tasks. We release the code at https://github.com/snu-mllab/LayerMerge.

LayerMerge: Neural Network Depth Compression through Layer Pruning and Merging

TL;DR

LayerMerge addresses the latency penalties of depth compression by jointly pruning activation and convolution layers and solving a surrogate optimization with dynamic programming. It introduces per-block importance and latency measures to select which layers to remove under a latency budget, enabling exact DP solutions with complexity on the order of O(L^2 P K0). Empirically, LayerMerge consistently outperforms existing depth compression and layer-pruning baselines on CNNs (ResNet-34, MobileNetV2) and diffusion-model generators (DDPM), with notable speed-ups and minimal accuracy loss, and it can complement channel-pruning approaches. The work provides a practical, hardware-aware framework for depth reduction with broad applicability and released code for reproducibility.

Abstract

Recent works show that reducing the number of layers in a convolutional neural network can enhance efficiency while maintaining the performance of the network. Existing depth compression methods remove redundant non-linear activation functions and merge the consecutive convolution layers into a single layer. However, these methods suffer from a critical drawback; the kernel size of the merged layers becomes larger, significantly undermining the latency reduction gained from reducing the depth of the network. We show that this problem can be addressed by jointly pruning convolution layers and activation functions. To this end, we propose LayerMerge, a novel depth compression method that selects which activation layers and convolution layers to remove, to achieve a desired inference speed-up while minimizing performance loss. Since the corresponding selection problem involves an exponential search space, we formulate a novel surrogate optimization problem and efficiently solve it via dynamic programming. Empirical results demonstrate that our method consistently outperforms existing depth compression and layer pruning methods on various network architectures, both on image classification and generation tasks. We release the code at https://github.com/snu-mllab/LayerMerge.
Paper Structure (42 sections, 2 theorems, 13 equations, 7 figures, 11 tables, 2 algorithms)

This paper contains 42 sections, 2 theorems, 13 equations, 7 figures, 11 tables, 2 algorithms.

Key Result

Theorem 3.1

The solution $A^*$ and $(k_i^*)_{i=1}^{|A^*|+1}$ given by alg:dp is an optimal solution of Problem eq:dp_master.

Figures (7)

  • Figure 1: An illustration of the increase in kernel size significantly undermining the latency reduction in the depth compression framework. Here, $\theta_l$ denotes the $l$-th convolution parameter, $X^{(l)}$ denotes the $l$-th feature map, and $\mathrm{Ker}(\cdot)$ denotes the kernel size of the parameter. As the layers are merged, the kernel size of the merged layer continues to grow, impeding the latency reduction. The latency is measured for the depicted model, on RTX2080 Ti, with channel size 256, input resolution $56 \times 56$, and batch size 128.
  • Figure 2: A qualitative example comparing our method to the depth compression baseline kim23efficient, applied to MobileNetV2-1.4 model on ImageNet dataset. Existing depth compression methods have limitations in reducing latency due to the inevitable increase in the kernel size of the merged layer. Our method effectively bypasses this challenge by jointly optimizing the selection of the convolution layers and the non-linear activation layers.
  • Figure 3: Test accuracy recovery curve of different compression methods across fine-tuning epochs. We indicate the associated speed-up and accuracy after fine-tuning in the parentheses. The inference time is measured on RTX2080 Ti GPU at batch size 128 in PyTorch format.
  • Figure 4: Test accuracy recovery curve of our method compared to knowledge distillation across fine-tuning epochs for MobileNetV2-1.0. We indicate the associated speed-up and the accuracy after fine-tuning in the parentheses. The inference time on RTX2080 Ti GPU at batch size 128 in PyTorch format.
  • Figure 5: Pareto curve of each compression method applied to each network. The latency speed-up is measured on RTX2080 Ti GPU in PyTorch format, with batch size of 128 for ImageNet dataset and batch size of 128 for CIFAR10 dataset.
  • ...and 2 more figures

Theorems & Definitions (3)

  • Theorem 3.1
  • Theorem 2.1
  • proof