Table of Contents
Fetching ...

Improving Feature Stability during Upsampling -- Spectral Artifacts and the Importance of Spatial Context

Shashank Agnihotri, Julia Grabinski, Margret Keuper

TL;DR

This work addresses spectral artifacts that emerge during upsampling in pixel-wise prediction tasks. By theoretical Fourier-domain analysis, it proposes Large Context Transposed Convolutions (LCTC) with $7\times7$ or $11\times11$ kernels (plus a parallel $3\times3$ path) to provide richer spatial context during upsampling. Extensive experiments across image restoration, semantic segmentation, and disparity estimation show that LCTC reduces spectral artifacts and improves robustness under adversarial attacks, while mere increases in decoder capacity without upsampling context do not help. The findings suggest that incorporating large-context upsampling into encoder-decoder architectures yields practical stability gains for modern vision models, including ViT-based backbones, with a reasonable trade-off in clean performance and actionable limitations discussed for future work.

Abstract

Pixel-wise predictions are required in a wide variety of tasks such as image restoration, image segmentation, or disparity estimation. Common models involve several stages of data resampling, in which the resolution of feature maps is first reduced to aggregate information and then increased to generate a high-resolution output. Previous works have shown that resampling operations are subject to artifacts such as aliasing. During downsampling, aliases have been shown to compromise the prediction stability of image classifiers. During upsampling, they have been leveraged to detect generated content. Yet, the effect of aliases during upsampling has not yet been discussed w.r.t. the stability and robustness of pixel-wise predictions. While falling under the same term (aliasing), the challenges for correct upsampling in neural networks differ significantly from those during downsampling: when downsampling, some high frequencies can not be correctly represented and have to be removed to avoid aliases. However, when upsampling for pixel-wise predictions, we actually require the model to restore such high frequencies that can not be encoded in lower resolutions. The application of findings from signal processing is therefore a necessary but not a sufficient condition to achieve the desirable output. In contrast, we find that the availability of large spatial context during upsampling allows to provide stable, high-quality pixel-wise predictions, even when fully learning all filter weights.

Improving Feature Stability during Upsampling -- Spectral Artifacts and the Importance of Spatial Context

TL;DR

This work addresses spectral artifacts that emerge during upsampling in pixel-wise prediction tasks. By theoretical Fourier-domain analysis, it proposes Large Context Transposed Convolutions (LCTC) with or kernels (plus a parallel path) to provide richer spatial context during upsampling. Extensive experiments across image restoration, semantic segmentation, and disparity estimation show that LCTC reduces spectral artifacts and improves robustness under adversarial attacks, while mere increases in decoder capacity without upsampling context do not help. The findings suggest that incorporating large-context upsampling into encoder-decoder architectures yields practical stability gains for modern vision models, including ViT-based backbones, with a reasonable trade-off in clean performance and actionable limitations discussed for future work.

Abstract

Pixel-wise predictions are required in a wide variety of tasks such as image restoration, image segmentation, or disparity estimation. Common models involve several stages of data resampling, in which the resolution of feature maps is first reduced to aggregate information and then increased to generate a high-resolution output. Previous works have shown that resampling operations are subject to artifacts such as aliasing. During downsampling, aliases have been shown to compromise the prediction stability of image classifiers. During upsampling, they have been leveraged to detect generated content. Yet, the effect of aliases during upsampling has not yet been discussed w.r.t. the stability and robustness of pixel-wise predictions. While falling under the same term (aliasing), the challenges for correct upsampling in neural networks differ significantly from those during downsampling: when downsampling, some high frequencies can not be correctly represented and have to be removed to avoid aliases. However, when upsampling for pixel-wise predictions, we actually require the model to restore such high frequencies that can not be encoded in lower resolutions. The application of findings from signal processing is therefore a necessary but not a sufficient condition to achieve the desirable output. In contrast, we find that the availability of large spatial context during upsampling allows to provide stable, high-quality pixel-wise predictions, even when fully learning all filter weights.
Paper Structure (48 sections, 6 equations, 14 figures, 17 tables)

This paper contains 48 sections, 6 equations, 14 figures, 17 tables.

Figures (14)

  • Figure 1: Image restoration example using NAFNet chen2022simple variants on GoPro gopro. Upsampling techniques like Pixel Shufflepixelshuffle (first row) and transposed convolutiondumoulin2016guide using small learnable filters (2$\times$2 or 3$\times$3) (second row) are used by most prior art. Both lead to spectral artifacts for which the model needs to compensate. The clean (in-domain) restored images look appealing - while adversaries (here 5-step PGD pgd attack) can leverage aliases such that artifacts become easily visible. When observed in the frequency domain, they manifest as repeating peaks all over the spectra. Based on sampling theoretic considerations, we propose Large Context Transposed Convolutions (7$\times$7 or larger) (bottom row). They significantly increase the model's stability during upsampling, observable in the restored image under attack and the frequency spectrum.
  • Figure 2: (Left) Linear interpolation (pink) of the samples (green) causes aliases. (Right) Optimal signal reconstruction (pink) is achieved by $\mathrm{sinc}$ interpolation. In practice our spatial context is limited and the interpolation function is discrete. Yet, increasing the kernel size enables the approximation of larger $\mathrm{sinc}$-like structures.
  • Figure 3: An image from GoProgopro downsampled with 3$\times$3 MaxPooling and then upsampled using various upsampling techniques. The resulting artifacts are compared on zoomed-in red box regions for better visibility. Bilinear interpolation causes over-smoothing. Bicubic interpolation causes overestimation along image boundaries while Pixel Shuffle and Nearest Neighbor cause strong grid artifacts along with discoloration. Small kernel transposed convolutions cause grid artifacts, however, on increasing kernel size we start getting better upsampling.
  • Figure 4: Abstract representation of an encoder-decoder architecture. While for different tasks, the implementation of the model encoder varies (including transformer-based encoders), our study focuses on the model decoder (in green). The backbone for the decoder is commonly a ResNet-like structure for feature extraction unetsegnet, additionally we also used a ConvNeXt-like convnext structure. We investigate variants of different upsampling operations (the operations along the red arrows in the decoder) for fixed decoder blocks. We consider, as a probe for H\ref{['hyp:first']}, the baseline transposed deconvolution (a) in the top right), and for LCTC an increased convolution kernel size (b) in the top right), and an increased convolution kernel with a second path using a small convolution kernel (c) in the top right). To test whether the plain increase in parameters is responsible for improved results (zero hypotheses, H\ref{['hyp:second']}), we also ablate on the increase of convolution kernel size in the decoder block (operations along the blue arrows in the green block), as shown on the bottom right. We consider the common ResNet-like decoder building block structure (in d)) and two ConvNext-like structured backbones for the decoder building block in e) and f), where f) has an additional small convolution applied in parallel, analog to c).
  • Figure 5: NAFNet, as proposed, uses Pixel Shuffle for upsampling. We modify only the upsampling operations to transposed convolution with kernel size (3$\times$3) and LCTC (Ours) for comparisons. We observe, for example, under a 10-step PGD attack with $\epsilon\approx\frac{8}{255}$ our proposed H\ref{['hyp:first']} gains validity. More examples for chen2022simplezamir2022restormer using different attacks and budgets are in \ref{['subsec:appendix:image_restoration:visual_results']}.
  • ...and 9 more figures