Improving Feature Stability during Upsampling -- Spectral Artifacts and the Importance of Spatial Context
Shashank Agnihotri, Julia Grabinski, Margret Keuper
TL;DR
This work addresses spectral artifacts that emerge during upsampling in pixel-wise prediction tasks. By theoretical Fourier-domain analysis, it proposes Large Context Transposed Convolutions (LCTC) with $7\times7$ or $11\times11$ kernels (plus a parallel $3\times3$ path) to provide richer spatial context during upsampling. Extensive experiments across image restoration, semantic segmentation, and disparity estimation show that LCTC reduces spectral artifacts and improves robustness under adversarial attacks, while mere increases in decoder capacity without upsampling context do not help. The findings suggest that incorporating large-context upsampling into encoder-decoder architectures yields practical stability gains for modern vision models, including ViT-based backbones, with a reasonable trade-off in clean performance and actionable limitations discussed for future work.
Abstract
Pixel-wise predictions are required in a wide variety of tasks such as image restoration, image segmentation, or disparity estimation. Common models involve several stages of data resampling, in which the resolution of feature maps is first reduced to aggregate information and then increased to generate a high-resolution output. Previous works have shown that resampling operations are subject to artifacts such as aliasing. During downsampling, aliases have been shown to compromise the prediction stability of image classifiers. During upsampling, they have been leveraged to detect generated content. Yet, the effect of aliases during upsampling has not yet been discussed w.r.t. the stability and robustness of pixel-wise predictions. While falling under the same term (aliasing), the challenges for correct upsampling in neural networks differ significantly from those during downsampling: when downsampling, some high frequencies can not be correctly represented and have to be removed to avoid aliases. However, when upsampling for pixel-wise predictions, we actually require the model to restore such high frequencies that can not be encoded in lower resolutions. The application of findings from signal processing is therefore a necessary but not a sufficient condition to achieve the desirable output. In contrast, we find that the availability of large spatial context during upsampling allows to provide stable, high-quality pixel-wise predictions, even when fully learning all filter weights.
