Table of Contents
Fetching ...

Rethinking the Role of Spatial Mixing

George Cazenavette, Joel Julin, Simon Lucey

TL;DR

The paper investigates the distinct roles of spatial and channel mixing in isotropic vision models by freezing spatial (depthwise) filters and training only the channel (pointwise) filters. Across ResNet and ConvMixer architectures, channels-only models achieve performance close to fully learned networks, while spatial-only models perform worse, highlighting the primacy of channel-mixing learning. Moreover, random spatial mixing imparts natural robustness to adversarial attacks, which can be amplified by smoothing the random filters, and the approach extends beyond classification to tasks like pixel un-shuffling. These findings suggest architectural efficiency: complex learning can focus on channel mixing, with random spatial mixers providing desirable robustness and spectral properties, informing future design of robust, efficient vision models.

Abstract

Until quite recently, the backbone of nearly every state-of-the-art computer vision model has been the 2D convolution. At its core, a 2D convolution simultaneously mixes information across both the spatial and channel dimensions of a representation. Many recent computer vision architectures consist of sequences of isotropic blocks that disentangle the spatial and channel-mixing components. This separation of the operations allows us to more closely juxtapose the effects of spatial and channel mixing in deep learning. In this paper, we take an initial step towards garnering a deeper understanding of the roles of these mixing operations. Through our experiments and analysis, we discover that on both classical (ResNet) and cutting-edge (ConvMixer) models, we can reach nearly the same level of classification performance by and leaving the spatial mixers at their random initializations. Furthermore, we show that models with random, fixed spatial mixing are naturally more robust to adversarial perturbations. Lastly, we show that this phenomenon extends past the classification regime, as such models can also decode pixel-shuffled images.

Rethinking the Role of Spatial Mixing

TL;DR

The paper investigates the distinct roles of spatial and channel mixing in isotropic vision models by freezing spatial (depthwise) filters and training only the channel (pointwise) filters. Across ResNet and ConvMixer architectures, channels-only models achieve performance close to fully learned networks, while spatial-only models perform worse, highlighting the primacy of channel-mixing learning. Moreover, random spatial mixing imparts natural robustness to adversarial attacks, which can be amplified by smoothing the random filters, and the approach extends beyond classification to tasks like pixel un-shuffling. These findings suggest architectural efficiency: complex learning can focus on channel mixing, with random spatial mixers providing desirable robustness and spectral properties, informing future design of robust, efficient vision models.

Abstract

Until quite recently, the backbone of nearly every state-of-the-art computer vision model has been the 2D convolution. At its core, a 2D convolution simultaneously mixes information across both the spatial and channel dimensions of a representation. Many recent computer vision architectures consist of sequences of isotropic blocks that disentangle the spatial and channel-mixing components. This separation of the operations allows us to more closely juxtapose the effects of spatial and channel mixing in deep learning. In this paper, we take an initial step towards garnering a deeper understanding of the roles of these mixing operations. Through our experiments and analysis, we discover that on both classical (ResNet) and cutting-edge (ConvMixer) models, we can reach nearly the same level of classification performance by and leaving the spatial mixers at their random initializations. Furthermore, we show that models with random, fixed spatial mixing are naturally more robust to adversarial perturbations. Lastly, we show that this phenomenon extends past the classification regime, as such models can also decode pixel-shuffled images.

Paper Structure

This paper contains 28 sections, 5 equations, 16 figures.

Figures (16)

  • Figure 1: The spectral envelope of a bank of random filters rapidly grows as we add more uncorrelated filters to the set, allowing for quality feature extraction from natural images.
  • Figure 2: While the fully learned ResNets (Full) outperform all others, we see that the models that only learn channel mixing (Chans) remain quite competitive, especially so on ImageNet (right). Conversely, the models that only learn spatial mixing (Space) lag very far behind the others.
  • Figure 3: While randomly initialized filters can provide competitive results, the same is not true for any arbitrary, fixed filter. Random filters work best when they are uncorrelated from eachother, allowing them to extract different information.
  • Figure 4: As we increase the network width, we see the performance of the channels-only models converges to that of the fully-learned models
  • Figure 5: With the ConvMixer architecture, we can better analyze the direct contributions of spatial and channel mixing without altering the original model. We again see networks that only learn channel mixing (Chans) remain competitive with their fully-learned counterparts (Full) while completely out-classing those that only learn spatial mixing (Space).
  • ...and 11 more figures