CNN Mixture-of-Depths

Rinor Cakaj; Jens Mehnert; Bin Yang

CNN Mixture-of-Depths

Rinor Cakaj, Jens Mehnert, Bin Yang

TL;DR

CNN Mixture-of-Depths (MoD) tackles the computational bottleneck of CNNs by dynamically selecting the most informative channels within Conv-Blocks while preserving a fixed tensor shape through a fusion mechanism, enabling a static computation graph with dynamic resource allocation. The method combines a Channel Selector, reduced-channel Conv-Blocks, and a fusion step to maintain dimensionality, resulting in substantial speedups with little to no loss in accuracy across ImageNet, Cityscapes, and Pascal VOC, and with improvements observable in CIFAR as well. Key contributions include demonstrating that a fixed-graph MoD approach can realize practical speedups without custom CUDA kernels or specialized losses, and showing that channel-wise selective processing yields both efficiency and regularization benefits. The results indicate strong potential for deploying efficient CNNs on resource-constrained devices while maintaining competitive performance in vision tasks, with future work focusing on kernel-level optimization of the fusion path and optimal channel counts per block.

Abstract

We introduce Mixture-of-Depths (MoD) for Convolutional Neural Networks (CNNs), a novel approach that enhances the computational efficiency of CNNs by selectively processing channels based on their relevance to the current prediction. This method optimizes computational resources by dynamically selecting key channels in feature maps for focused processing within the convolutional blocks (Conv-Blocks), while skipping less relevant channels. Unlike conditional computation methods that require dynamic computation graphs, CNN MoD uses a static computation graph with fixed tensor sizes which improve hardware efficiency. It speeds up the training and inference processes without the need for customized CUDA kernels, unique loss functions, or finetuning. CNN MoD either matches the performance of traditional CNNs with reduced inference times, GMACs, and parameters, or exceeds their performance while maintaining similar inference times, GMACs, and parameters. For example, on ImageNet, ResNet86-MoD exceeds the performance of the standard ResNet50 by 0.45% with a 6% speedup on CPU and 5% on GPU. Moreover, ResNet75-MoD achieves the same performance as ResNet50 with a 25% speedup on CPU and 15% on GPU.

CNN Mixture-of-Depths

TL;DR

Abstract

Paper Structure (34 sections, 1 equation, 4 figures, 11 tables)

This paper contains 34 sections, 1 equation, 4 figures, 11 tables.

Introduction
Related Work
Static Pruning
Dynamic Computing
CNN MoD: Combining Static and Dynamic Advantages
Method: CNN Mixture-of-Depths
Channel Selector
Channel Processing Dynamics
Fusion Mechanism
Integration in CNN Architecture
Experiments
Image Recognition on ImageNet
Semantic Segmentation on Cityscapes
Object Detection on Pascal VOC
Channel Selection and Regularization Analysis
...and 19 more sections

Figures (4)

Figure 1: Illustration of the CNN MoD mechanism, which starts with the Channel Selector module. This module computes the importance scores of each channel in the input feature map, $X \in \mathbb{R}^{C \times H \times W}$, and selects the top-$k$ channels for focussed processing in the Conv-Block. These selected channels are then processed by a Conv-Block designed to operate on a reduced dimension, $\hat{X} \in \mathbb{R}^{k \times H \times W}$, enhancing computational efficiency. The processed channels are added to the first $k$ channels of the original feature map through a fusion operation, instead of being added back to their original positions. The resulting feature map is denoted by $\bar{X}$. This selective reintegration of refined channels with the unprocessed channels helps to preserve the dimensions of the original feature map ($X \in \mathbb{R}^{C \times H \times W}$).
Figure 2: Illustration of the Channel Selection Process in MoD. The process begins with an input tensor $X \in \mathbb{R}^{C \times H \times W}$, which undergoes adaptive average pooling to reduce spatial dimensions to $1 \times 1$, preserving channel information. The pooled output is processed through a two-layer fully connected network with a reduction factor ($r=16$), followed by a sigmoid activation to generate channel-wise scores. These scores are used to select the top-$k$ channels. This forms a subset of the original tensor with reduced channel dimension but original spatial dimensions.
Figure 3: ResNet MoD models outperform standard ResNets under similar computational constraints, as shown across four panels. Panel (a) shows the higher accuracy per GMAC, highlighting better computational efficiency, while panel (b) illustrates the improved parameter efficiency. Panels (c) and (d) demonstrate ResNet MoD's superior top-1 accuracy with comparable or faster inference times on CPU and GPU.
Figure 4: Channel selection frequencies within the third module of a ResNet75-MoD for five diverse ImageNet classes: plane, truck, church, cliff, and pug. The analysis shows the percentage of times channels are selected by the Channel Selector out of the total selections in the layer, based on fifty samples per class. These percentages are compared to a baseline derived from all 1000 classes (\ref{['fig:class_all']}), indicating that the Channel Selector selects different channels for different classes.

CNN Mixture-of-Depths

TL;DR

Abstract

CNN Mixture-of-Depths

Authors

TL;DR

Abstract

Table of Contents

Figures (4)