Table of Contents
Fetching ...

Till the Layers Collapse: Compressing a Deep Neural Network through the Lenses of Batch Normalization Layers

Zhu Liao, Nour Hezbri, Victor Quétu, Van-Tam Nguyen, Enzo Tartaglione

TL;DR

The paper tackles the inefficiency of overparameterized deep networks by proposing Till the Layers Collapse (TLC), a BN-driven method that prunes entire layers to reduce depth and latency. TLC relies on batch normalization parameters, using the BN statistics to infer an ON/OFF state for individual neurons and to decide which layers are expendable without substantial accuracy loss. By ranking layers by their impact on performance and removing the least important ones—while optionally linearizing remain­ing ON neurons—TLC achieves depth compression with competitive or improved accuracy across image classification and NLP tasks, often outperforming BN-based pruning baselines. This approach promises more sustainable AI by reducing compute and energy demands, with validated improvements across diverse architectures like ResNet-18, Swin-T, MobileNet-V2, VGG-16bn, BERT, and RoBERTa on multiple datasets; it also highlights an accessible, BN-statistics-driven path for scalable model compression.

Abstract

Today, deep neural networks are widely used since they can handle a variety of complex tasks. Their generality makes them very powerful tools in modern technology. However, deep neural networks are often overparameterized. The usage of these large models consumes a lot of computation resources. In this paper, we introduce a method called \textbf{T}ill the \textbf{L}ayers \textbf{C}ollapse (TLC), which compresses deep neural networks through the lenses of batch normalization layers. By reducing the depth of these networks, our method decreases deep neural networks' computational requirements and overall latency. We validate our method on popular models such as Swin-T, MobileNet-V2, and RoBERTa, across both image classification and natural language processing (NLP) tasks.

Till the Layers Collapse: Compressing a Deep Neural Network through the Lenses of Batch Normalization Layers

TL;DR

The paper tackles the inefficiency of overparameterized deep networks by proposing Till the Layers Collapse (TLC), a BN-driven method that prunes entire layers to reduce depth and latency. TLC relies on batch normalization parameters, using the BN statistics to infer an ON/OFF state for individual neurons and to decide which layers are expendable without substantial accuracy loss. By ranking layers by their impact on performance and removing the least important ones—while optionally linearizing remain­ing ON neurons—TLC achieves depth compression with competitive or improved accuracy across image classification and NLP tasks, often outperforming BN-based pruning baselines. This approach promises more sustainable AI by reducing compute and energy demands, with validated improvements across diverse architectures like ResNet-18, Swin-T, MobileNet-V2, VGG-16bn, BERT, and RoBERTa on multiple datasets; it also highlights an accessible, BN-statistics-driven path for scalable model compression.

Abstract

Today, deep neural networks are widely used since they can handle a variety of complex tasks. Their generality makes them very powerful tools in modern technology. However, deep neural networks are often overparameterized. The usage of these large models consumes a lot of computation resources. In this paper, we introduce a method called \textbf{T}ill the \textbf{L}ayers \textbf{C}ollapse (TLC), which compresses deep neural networks through the lenses of batch normalization layers. By reducing the depth of these networks, our method decreases deep neural networks' computational requirements and overall latency. We validate our method on popular models such as Swin-T, MobileNet-V2, and RoBERTa, across both image classification and natural language processing (NLP) tasks.

Paper Structure

This paper contains 21 sections, 4 equations, 6 figures, 9 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overview of the key steps for TLC: identification of layer to remove, removal of irrelevant channels, and linearization of the remaining, removal of the layer.
  • Figure 2: Error plot for the i-th neuron of the l-th layer as a function of the batch norm mean parameter $\beta_{l,i}$ for a standard deviation $\gamma_{l,i}=1$.
  • Figure 3: Validation loss for the complete Resnet-18 model pre-trained on Cifar-10 and one layer is removed.
  • Figure 4: Test performance (top-1) for models trained on CIFAR-10 with different numbers of layers removed by TLC.
  • Figure 5: Kullback-Leibler (KL) divergence between the output features of the original VGG-16bn model trained on CIFAR-10 and models removed layers by different methods.
  • ...and 1 more figures