Pick-or-Mix: Dynamic Channel Sampling for ConvNets
Ashish Kumar, Daneul Kim, Jaesik Park, Laxmidhar Behera
TL;DR
This paper introduces Pick-or-Mix (PiX), a dynamic, per-pixel channel sampling module that mitigates the computational dominance of $1\times1$ channel squeezing in ConvNets without requiring specialized implementations. PiX computes a compact global context per channel and derives per-output-channel sampling probabilities to fuse channel subsets with either the max or average operator on a per-pixel basis, enabling dynamic, context-driven channel selection. The authors demonstrate PiX as a versatile module that can squeeze channels, downscale networks, and function as a dynamic pruner, achieving substantial FLOP reductions (e.g., about 23% in ResNet squeezing) and practical latency gains (up to ~32% on various GPUs) while maintaining or improving accuracy across ResNet, VGG, MobileNet, and ViT backbones. The approach consistently outperforms traditional attention-based modules (SE/CBAM) and existing dynamic pruning methods, and it generalizes to vision transformers, making PiX a lightweight, widely applicable tool for efficient deep models with broad real-world impact.
Abstract
Channel pruning approaches for convolutional neural networks (ConvNets) deactivate the channels, statically or dynamically, and require special implementation. In addition, channel squeezing in representative ConvNets is carried out via 1x1 convolutions which dominates a large portion of computations and network parameters. Given these challenges, we propose an effective multi-purpose module for dynamic channel sampling, namely Pick-or-Mix (PiX), which does not require special implementation. PiX divides a set of channels into subsets and then picks from them, where the picking decision is dynamically made per each pixel based on the input activations. We plug PiX into prominent ConvNet architectures and verify its multi-purpose utilities. After replacing 1x1 channel squeezing layers in ResNet with PiX, the network becomes 25% faster without losing accuracy. We show that PiX allows ConvNets to learn better data representation than widely adopted approaches to enhance networks' representation power (e.g., SE, CBAM, AFF, SKNet, and DWP). We also show that PiX achieves state-of-the-art performance on network downscaling and dynamic channel pruning applications.
