Rethinking the Role of Spatial Mixing
George Cazenavette, Joel Julin, Simon Lucey
TL;DR
The paper investigates the distinct roles of spatial and channel mixing in isotropic vision models by freezing spatial (depthwise) filters and training only the channel (pointwise) filters. Across ResNet and ConvMixer architectures, channels-only models achieve performance close to fully learned networks, while spatial-only models perform worse, highlighting the primacy of channel-mixing learning. Moreover, random spatial mixing imparts natural robustness to adversarial attacks, which can be amplified by smoothing the random filters, and the approach extends beyond classification to tasks like pixel un-shuffling. These findings suggest architectural efficiency: complex learning can focus on channel mixing, with random spatial mixers providing desirable robustness and spectral properties, informing future design of robust, efficient vision models.
Abstract
Until quite recently, the backbone of nearly every state-of-the-art computer vision model has been the 2D convolution. At its core, a 2D convolution simultaneously mixes information across both the spatial and channel dimensions of a representation. Many recent computer vision architectures consist of sequences of isotropic blocks that disentangle the spatial and channel-mixing components. This separation of the operations allows us to more closely juxtapose the effects of spatial and channel mixing in deep learning. In this paper, we take an initial step towards garnering a deeper understanding of the roles of these mixing operations. Through our experiments and analysis, we discover that on both classical (ResNet) and cutting-edge (ConvMixer) models, we can reach nearly the same level of classification performance by and leaving the spatial mixers at their random initializations. Furthermore, we show that models with random, fixed spatial mixing are naturally more robust to adversarial perturbations. Lastly, we show that this phenomenon extends past the classification regime, as such models can also decode pixel-shuffled images.
