Spectral-Adaptive Modulation Networks for Visual Perception
Guhnoo Yun, Juhan Yoo, Kijung Kim, Jeongho Lee, Paul Hongsuck Seo, Dong Hwan Kim
TL;DR
This work theoretically analyzes the spectral properties of 2D convolution and self-attention via graph signal processing, showing that node connectivity controlled by window size drives their frequency responses and that large kernels tend to low-pass behavior similar to self-attention. Building on this, it introduces SPAM, a spectral-adaptive token mixer using multi-kernel convolutions and a spectral re-scaling filter implemented in the FFT domain, and integrates it into SPANetV2, a four-stage MetaFormer backbone. SPANetV2 achieves state-of-the-art or competitive results across ImageNet-1K classification, COCO object detection/instance segmentation, and ADE20K semantic segmentation, while providing a principled framework for spectral adaptation and texture-shape bias management. The work suggests that spectral-adaptive modulation can generalize across perception tasks, with potential extensions to efficiency-focused designs and other modalities.
Abstract
Recent studies have shown that 2D convolution and self-attention exhibit distinct spectral behaviors, and optimizing their spectral properties can enhance vision model performance. However, theoretical analyses remain limited in explaining why 2D convolution is more effective in high-pass filtering than self-attention and why larger kernels favor shape bias, akin to self-attention. In this paper, we employ graph spectral analysis to theoretically simulate and compare the frequency responses of 2D convolution and self-attention within a unified framework. Our results corroborate previous empirical findings and reveal that node connectivity, modulated by window size, is a key factor in shaping spectral functions. Leveraging this insight, we introduce a \textit{spectral-adaptive modulation} (SPAM) mixer, which processes visual features in a spectral-adaptive manner using multi-scale convolutional kernels and a spectral re-scaling mechanism to refine spectral components. Based on SPAM, we develop SPANetV2 as a novel vision backbone. Extensive experiments demonstrate that SPANetV2 outperforms state-of-the-art models across multiple vision tasks, including ImageNet-1K classification, COCO object detection, and ADE20K semantic segmentation.
