Table of Contents
Fetching ...

Alias-Free Convnets: Fractional Shift Invariance via Polynomial Activations

Hagay Michaeli, Tomer Michaeli, Daniel Soudry

TL;DR

This work addresses the overlooked problem that standard CNNs are not truly shift-invariant due to aliasing from downsampling and nonlinearities. It introduces Alias-Free ConvNets (AFC) that couple polynomial activations with an upsample–low-pass–downsample pipeline and alias-free normalization to guarantee shift-invariance for fractional translations and shift-equivariance of internal representations. Empirically, AFC achieves 100% shift consistency for integer and fractional shifts, shows certified robustness to translation-based adversarial attacks, and maintains competitive ImageNet performance, outperforming prior methods like APS and BlurPool under translation perturbations. The approach has practical implications for robust vision systems and can be extended to other domains and tasks such as segmentation, with opportunities to optimize computational efficiency.

Abstract

Although CNNs are believed to be invariant to translations, recent works have shown this is not the case, due to aliasing effects that stem from downsampling layers. The existing architectural solutions to prevent aliasing are partial since they do not solve these effects, that originate in non-linearities. We propose an extended anti-aliasing method that tackles both downsampling and non-linear layers, thus creating truly alias-free, shift-invariant CNNs. We show that the presented model is invariant to integer as well as fractional (i.e., sub-pixel) translations, thus outperforming other shift-invariant methods in terms of robustness to adversarial translations.

Alias-Free Convnets: Fractional Shift Invariance via Polynomial Activations

TL;DR

This work addresses the overlooked problem that standard CNNs are not truly shift-invariant due to aliasing from downsampling and nonlinearities. It introduces Alias-Free ConvNets (AFC) that couple polynomial activations with an upsample–low-pass–downsample pipeline and alias-free normalization to guarantee shift-invariance for fractional translations and shift-equivariance of internal representations. Empirically, AFC achieves 100% shift consistency for integer and fractional shifts, shows certified robustness to translation-based adversarial attacks, and maintains competitive ImageNet performance, outperforming prior methods like APS and BlurPool under translation perturbations. The approach has practical implications for robust vision systems and can be extended to other domains and tasks such as segmentation, with opportunities to optimize computational efficiency.

Abstract

Although CNNs are believed to be invariant to translations, recent works have shown this is not the case, due to aliasing effects that stem from downsampling layers. The existing architectural solutions to prevent aliasing are partial since they do not solve these effects, that originate in non-linearities. We propose an extended anti-aliasing method that tackles both downsampling and non-linear layers, thus creating truly alias-free, shift-invariant CNNs. We show that the presented model is invariant to integer as well as fractional (i.e., sub-pixel) translations, thus outperforming other shift-invariant methods in terms of robustness to adversarial translations.
Paper Structure (50 sections, 2 theorems, 86 equations, 12 figures, 4 tables, 2 algorithms)

This paper contains 50 sections, 2 theorems, 86 equations, 12 figures, 4 tables, 2 algorithms.

Key Result

Proposition 1

In a network comprised of a Feature Extractor and a Classifier, if the Feature Extractor ends with a global average pooling layer, then shift-equivariance w.r.t. the continuous domain of the Feature Extractor implies shift-invariance w.r.t. the continuous domain of the entire model.

Figures (12)

  • Figure 1: ConvNeXt baseline architecture vs AFC modifications. D-conv: depthwise convolution $7 \times 7$, P-Conv: pointwise convolution, Strided-conv: convolution $4 \times 4$, stride $4$. LN: Layer Norm, AF-LN: Alias free Layer Norm, Poly: Polynomial activation. Up x2: Upsample x2, LPF: ideal LPF with cutoff 0.5, Down x2: Downsample x2. Detailed explanations about BlurPool, Poly and LPF-Poly activations can be found in \ref{['sec:implementation']}.
  • Figure 2: A demonstration of the proposed non-linearities in the frequency domain. The top plot at each panel represents the signal in the continuous domain, and the bottom represents the discrete domain. Where the input (a) is upsampled it shrinks its frequency response, expanding the allowed frequencies (b). Applying the polynomial activation expands the frequency response support by as factor $d$, without causing aliasing in the relevant frequencies (c.1). Thus, the discrete signal remains a faithful representation of the continuous signal after applying LPF (d1) and downsample back to the same spatial size (d2). However, applying GeLU expands the support infinitely (c.2). This leads to an aliasing effect --- interference in the relevant frequencies marked in red in (c2). This causes the discrete signal not to be a correct representation of the continuous one, after LPF (d2) and downsampling (e2).
  • Figure 3: Shift-equivariance measure w.r.t. continuous signal. The averaged difference (\ref{['eq:diff']}) for $1/2$ pixel translated inputs (y-axis), across all layers (x-axis). This experiment was run on 64 random samples from the validation set. While the AFC model has practically 0 difference, the baseline and APS models have at least 50% difference across all layers.
  • Figure 4: Adversarial accuracy with image corruptions.Left: ImageNet-C accuracy (solid) vs. adversarial fractional grid accuracy (transparent). Right: Accuracy vs. adversarial accuracy difference (percentage). ConvNeXt-AFC (ours) ImageNet-C accuracy is not affected by translations, while in ConvNeXt-APS and ConvNeXt-Basline the relative accuracy degradation as a result of translations increases with the corruption severity.
  • Figure 5: Visualization of shift attacks similar to the framework of Engstrom2017ExploringRobustness. (a) The original image is zero-padded in 8 pixels in each direction. The attack is a translation of up to 8 pixels in each direction, e.g. (b) is a translation of $6$ and $-2.5$ pixels in $x$ and $y$ axes respectively. Sub-pixel translations are done using bilinear interpolation.
  • ...and 7 more figures

Theorems & Definitions (8)

  • Definition 1: Fractional translation for discrete signals
  • Definition 2: shift-equivariance w.r.t. the cont. domain
  • Definition 3: shift-invariance w.r.t. the cont. domain
  • Proposition 1
  • Proposition 2
  • Claim 1
  • Claim 2
  • Claim 3