Table of Contents
Fetching ...

Spectral-Adaptive Modulation Networks for Visual Perception

Guhnoo Yun, Juhan Yoo, Kijung Kim, Jeongho Lee, Paul Hongsuck Seo, Dong Hwan Kim

TL;DR

This work theoretically analyzes the spectral properties of 2D convolution and self-attention via graph signal processing, showing that node connectivity controlled by window size drives their frequency responses and that large kernels tend to low-pass behavior similar to self-attention. Building on this, it introduces SPAM, a spectral-adaptive token mixer using multi-kernel convolutions and a spectral re-scaling filter implemented in the FFT domain, and integrates it into SPANetV2, a four-stage MetaFormer backbone. SPANetV2 achieves state-of-the-art or competitive results across ImageNet-1K classification, COCO object detection/instance segmentation, and ADE20K semantic segmentation, while providing a principled framework for spectral adaptation and texture-shape bias management. The work suggests that spectral-adaptive modulation can generalize across perception tasks, with potential extensions to efficiency-focused designs and other modalities.

Abstract

Recent studies have shown that 2D convolution and self-attention exhibit distinct spectral behaviors, and optimizing their spectral properties can enhance vision model performance. However, theoretical analyses remain limited in explaining why 2D convolution is more effective in high-pass filtering than self-attention and why larger kernels favor shape bias, akin to self-attention. In this paper, we employ graph spectral analysis to theoretically simulate and compare the frequency responses of 2D convolution and self-attention within a unified framework. Our results corroborate previous empirical findings and reveal that node connectivity, modulated by window size, is a key factor in shaping spectral functions. Leveraging this insight, we introduce a \textit{spectral-adaptive modulation} (SPAM) mixer, which processes visual features in a spectral-adaptive manner using multi-scale convolutional kernels and a spectral re-scaling mechanism to refine spectral components. Based on SPAM, we develop SPANetV2 as a novel vision backbone. Extensive experiments demonstrate that SPANetV2 outperforms state-of-the-art models across multiple vision tasks, including ImageNet-1K classification, COCO object detection, and ADE20K semantic segmentation.

Spectral-Adaptive Modulation Networks for Visual Perception

TL;DR

This work theoretically analyzes the spectral properties of 2D convolution and self-attention via graph signal processing, showing that node connectivity controlled by window size drives their frequency responses and that large kernels tend to low-pass behavior similar to self-attention. Building on this, it introduces SPAM, a spectral-adaptive token mixer using multi-kernel convolutions and a spectral re-scaling filter implemented in the FFT domain, and integrates it into SPANetV2, a four-stage MetaFormer backbone. SPANetV2 achieves state-of-the-art or competitive results across ImageNet-1K classification, COCO object detection/instance segmentation, and ADE20K semantic segmentation, while providing a principled framework for spectral adaptation and texture-shape bias management. The work suggests that spectral-adaptive modulation can generalize across perception tasks, with potential extensions to efficiency-focused designs and other modalities.

Abstract

Recent studies have shown that 2D convolution and self-attention exhibit distinct spectral behaviors, and optimizing their spectral properties can enhance vision model performance. However, theoretical analyses remain limited in explaining why 2D convolution is more effective in high-pass filtering than self-attention and why larger kernels favor shape bias, akin to self-attention. In this paper, we employ graph spectral analysis to theoretically simulate and compare the frequency responses of 2D convolution and self-attention within a unified framework. Our results corroborate previous empirical findings and reveal that node connectivity, modulated by window size, is a key factor in shaping spectral functions. Leveraging this insight, we introduce a \textit{spectral-adaptive modulation} (SPAM) mixer, which processes visual features in a spectral-adaptive manner using multi-scale convolutional kernels and a spectral re-scaling mechanism to refine spectral components. Based on SPAM, we develop SPANetV2 as a novel vision backbone. Extensive experiments demonstrate that SPANetV2 outperforms state-of-the-art models across multiple vision tasks, including ImageNet-1K classification, COCO object detection, and ADE20K semantic segmentation.

Paper Structure

This paper contains 43 sections, 22 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Simulation examples of frequency response. (a)-(c) show responses of 2D Euclidean convolutions with increasing kernel sizes, and (d) shows responses of self-attention. All responses are obtained with random weights. The input patch size is set to $16 \times 16$, inspired by ViT dosovitskiy2020image. As the convolution kernel size increases, the cut-off frequency shifts closer to one, making it behave more like a low-pass filter, akin to self-attention.
  • Figure 2: Overview of the SPAM mixer. The Head Split layer evenly partitions the input along feature dimensions based on the number of heads. DWConv denotes depthwise convolution, while SRF re-scales the spectral components of DWConv's output. All linear layers preserve input dimensions, except Exp, which doubles the feature dimensions, and Proj, which halves them.
  • Figure 3: Relative log amplitudes of Fourier transformed feature maps on all stages. All models follow the MetaFormer baseline yu2023metaformer with configurations identical to Table \ref{['table:arch_config']}, except for the token mixers and activations.
  • Figure 4: Evaluation of models on texture and shape bias. All models are pretrained on ImageNet-1K classification using the same augmentations as the MetaFormer baseline yu2023metaformer.