Table of Contents
Fetching ...

Pick-or-Mix: Dynamic Channel Sampling for ConvNets

Ashish Kumar, Daneul Kim, Jaesik Park, Laxmidhar Behera

TL;DR

This paper introduces Pick-or-Mix (PiX), a dynamic, per-pixel channel sampling module that mitigates the computational dominance of $1\times1$ channel squeezing in ConvNets without requiring specialized implementations. PiX computes a compact global context per channel and derives per-output-channel sampling probabilities to fuse channel subsets with either the max or average operator on a per-pixel basis, enabling dynamic, context-driven channel selection. The authors demonstrate PiX as a versatile module that can squeeze channels, downscale networks, and function as a dynamic pruner, achieving substantial FLOP reductions (e.g., about 23% in ResNet squeezing) and practical latency gains (up to ~32% on various GPUs) while maintaining or improving accuracy across ResNet, VGG, MobileNet, and ViT backbones. The approach consistently outperforms traditional attention-based modules (SE/CBAM) and existing dynamic pruning methods, and it generalizes to vision transformers, making PiX a lightweight, widely applicable tool for efficient deep models with broad real-world impact.

Abstract

Channel pruning approaches for convolutional neural networks (ConvNets) deactivate the channels, statically or dynamically, and require special implementation. In addition, channel squeezing in representative ConvNets is carried out via 1x1 convolutions which dominates a large portion of computations and network parameters. Given these challenges, we propose an effective multi-purpose module for dynamic channel sampling, namely Pick-or-Mix (PiX), which does not require special implementation. PiX divides a set of channels into subsets and then picks from them, where the picking decision is dynamically made per each pixel based on the input activations. We plug PiX into prominent ConvNet architectures and verify its multi-purpose utilities. After replacing 1x1 channel squeezing layers in ResNet with PiX, the network becomes 25% faster without losing accuracy. We show that PiX allows ConvNets to learn better data representation than widely adopted approaches to enhance networks' representation power (e.g., SE, CBAM, AFF, SKNet, and DWP). We also show that PiX achieves state-of-the-art performance on network downscaling and dynamic channel pruning applications.

Pick-or-Mix: Dynamic Channel Sampling for ConvNets

TL;DR

This paper introduces Pick-or-Mix (PiX), a dynamic, per-pixel channel sampling module that mitigates the computational dominance of channel squeezing in ConvNets without requiring specialized implementations. PiX computes a compact global context per channel and derives per-output-channel sampling probabilities to fuse channel subsets with either the max or average operator on a per-pixel basis, enabling dynamic, context-driven channel selection. The authors demonstrate PiX as a versatile module that can squeeze channels, downscale networks, and function as a dynamic pruner, achieving substantial FLOP reductions (e.g., about 23% in ResNet squeezing) and practical latency gains (up to ~32% on various GPUs) while maintaining or improving accuracy across ResNet, VGG, MobileNet, and ViT backbones. The approach consistently outperforms traditional attention-based modules (SE/CBAM) and existing dynamic pruning methods, and it generalizes to vision transformers, making PiX a lightweight, widely applicable tool for efficient deep models with broad real-world impact.

Abstract

Channel pruning approaches for convolutional neural networks (ConvNets) deactivate the channels, statically or dynamically, and require special implementation. In addition, channel squeezing in representative ConvNets is carried out via 1x1 convolutions which dominates a large portion of computations and network parameters. Given these challenges, we propose an effective multi-purpose module for dynamic channel sampling, namely Pick-or-Mix (PiX), which does not require special implementation. PiX divides a set of channels into subsets and then picks from them, where the picking decision is dynamically made per each pixel based on the input activations. We plug PiX into prominent ConvNet architectures and verify its multi-purpose utilities. After replacing 1x1 channel squeezing layers in ResNet with PiX, the network becomes 25% faster without losing accuracy. We show that PiX allows ConvNets to learn better data representation than widely adopted approaches to enhance networks' representation power (e.g., SE, CBAM, AFF, SKNet, and DWP). We also show that PiX achieves state-of-the-art performance on network downscaling and dynamic channel pruning applications.
Paper Structure (47 sections, 19 equations, 7 figures, 10 tables)

This paper contains 47 sections, 19 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Conceptual overview of PiX in the context of channel reduction for ConvNets. Top: Traditional dense $1\times 1$ convolution. Although not all channels are important, dense convolutions process all the channels equally. Bottom: PiX avoids dense convolution and samples the channels dynamically from the input by producing sampling probabilities with far fewer FLOPs. PiX is multipurpose without requiring specialized implementations.
  • Figure 2: The proposed PiX module with its Pick-or-Mix dynamic channel sampling strategy. Each subset of input channels is picked (via max operator) or mixed (via average operator) to constitute the squeezed channels of the output. Interestingly, PiX can fuse channels differently for each pixel (please refer to Sec. \ref{['sec:channelfusion']}).
  • Figure 3: Embedding the proposed PiX into various standard networks for various purposes. (a) Channel Squeezing Mode: we replace $1\times 1$ channel squeezing layers in ResNet resnet with PiX, where the remaining $1\times 1$ conv layers in the original ResNet are untouched as it is intended for expanding channel dimensions. (b & c) Network Downscaling Mode: We insert PiX modules into ResNet and VGG vgg. We make the output channel dimension smaller than the input channel dimension by adjusting sampling factor $\zeta$ in PiX. In other words, depending on $\zeta$, The input and output channel dimensions of $1\times 1$ and $3\times 3$ conv layers change accordingly. As a result, as $\zeta$ gets larger, the channel dimension of the original network reduces. (c & d) Dynamic Channel Pruning: These configurations are used for comparing PiX with other dynamic channel pruning approaches.
  • Figure 4: PiX vs existing modules: SE senet, CBAM cbam, FBS fbs, and Group convolution shufflenetv1shufflenetv2.
  • Figure 5: Flops and Memory performance of PiX in contrast to SE senet CBAM cbam, and FBS fbs per-instance of a module. In the memory plot, SE and PiX has almost same overhead but PiX lesser than SE in terms of Bytes ($\sim$ 1,000), and same is with CBAM and FBS. For this reason plots are overlapping in the memory plot. The actual values are also highlighted in Table \ref{['tab:flops_mem_usage']}.
  • ...and 2 more figures