Table of Contents
Fetching ...

Improving Shift Invariance in Convolutional Neural Networks with Translation Invariant Polyphase Sampling

Sourajit Saha, Tejas Gokhale

TL;DR

Convolutional downsampling breaks shift invariance, undermining robustness to pixel-level shifts. The authors introduce Translation Invariant Polyphase Sampling (TIPS), a learnable pooling layer that uses polyphase decomposition and trainable mixing coefficients to reduce maximum-sampling bias (MSB), thereby enhancing shift invariance across classification, segmentation, and detection. Two regularizations, L_{FM} and L_{undo}, are proposed to discourage skewed or uniform mixing and to enable undoing standard shifts during training, with end-to-end optimization and marginal overhead. Extensive experiments show that TIPS yields state-of-the-art shift invariance and robustness on multiple benchmarks and architectures, outperforming data augmentation and contrastive methods. The work also provides a large-scale analysis of MSB and demonstrates the practical benefits of reduced MSB for real-world vision tasks.

Abstract

Downsampling operators break the shift invariance of convolutional neural networks (CNNs) and this affects the robustness of features learned by CNNs when dealing with even small pixel-level shift. Through a large-scale correlation analysis framework, we study shift invariance of CNNs by inspecting existing downsampling operators in terms of their maximum-sampling bias (MSB), and find that MSB is negatively correlated with shift invariance. Based on this crucial insight, we propose a learnable pooling operator called Translation Invariant Polyphase Sampling (TIPS) and two regularizations on the intermediate feature maps of TIPS to reduce MSB and learn translation-invariant representations. TIPS can be integrated into any CNN and can be trained end-to-end with marginal computational overhead. Our experiments demonstrate that TIPS results in consistent performance gains in terms of accuracy, shift consistency, and shift fidelity on multiple benchmarks for image classification and semantic segmentation compared to previous methods and also leads to improvements in adversarial and distributional robustness. TIPS results in the lowest MSB compared to all previous methods, thus explaining our strong empirical results.

Improving Shift Invariance in Convolutional Neural Networks with Translation Invariant Polyphase Sampling

TL;DR

Convolutional downsampling breaks shift invariance, undermining robustness to pixel-level shifts. The authors introduce Translation Invariant Polyphase Sampling (TIPS), a learnable pooling layer that uses polyphase decomposition and trainable mixing coefficients to reduce maximum-sampling bias (MSB), thereby enhancing shift invariance across classification, segmentation, and detection. Two regularizations, L_{FM} and L_{undo}, are proposed to discourage skewed or uniform mixing and to enable undoing standard shifts during training, with end-to-end optimization and marginal overhead. Extensive experiments show that TIPS yields state-of-the-art shift invariance and robustness on multiple benchmarks and architectures, outperforming data augmentation and contrastive methods. The work also provides a large-scale analysis of MSB and demonstrates the practical benefits of reduced MSB for real-world vision tasks.

Abstract

Downsampling operators break the shift invariance of convolutional neural networks (CNNs) and this affects the robustness of features learned by CNNs when dealing with even small pixel-level shift. Through a large-scale correlation analysis framework, we study shift invariance of CNNs by inspecting existing downsampling operators in terms of their maximum-sampling bias (MSB), and find that MSB is negatively correlated with shift invariance. Based on this crucial insight, we propose a learnable pooling operator called Translation Invariant Polyphase Sampling (TIPS) and two regularizations on the intermediate feature maps of TIPS to reduce MSB and learn translation-invariant representations. TIPS can be integrated into any CNN and can be trained end-to-end with marginal computational overhead. Our experiments demonstrate that TIPS results in consistent performance gains in terms of accuracy, shift consistency, and shift fidelity on multiple benchmarks for image classification and semantic segmentation compared to previous methods and also leads to improvements in adversarial and distributional robustness. TIPS results in the lowest MSB compared to all previous methods, thus explaining our strong empirical results.
Paper Structure (28 sections, 6 equations, 16 figures, 21 tables)

This paper contains 28 sections, 6 equations, 16 figures, 21 tables.

Figures (16)

  • Figure 1: Translation-Invariant Polyphase Sampling (TIPS) is a pooling operator that improves shift invariance of CNNs. (a) An illustration of the improvements in semantic segmentation with TIPS; (b) Greater shift consistency of TIPS at higher degrees of pixel shift; (c) TIPS results in consistent and architecture-agnostic improvements in accuracy and four measures of shift invariance for image classification and semantic segmentation.
  • Figure 2: TIPS downsamples ReLU-activated intermediate feature map $X$ into $\hat{X}$ with stride $s$ and learns polyphase mixing coefficients $\tau$ (using a small fully convolutional function $f_{\theta}$) which results in the output feature map as the weighted linear combination $\hat{X}$. The polyphase decomposition on input feature map $X$ results in $\mathrm{poly}_{i}$ which are then mixed as a weighted linear combination with $\tau$ to compute $\hat{X}$.
  • Figure 3: The end-to-end training pipeline with TIPS, regularization to undo shift $\mathcal{L}_{undo}$, regularization to discourage known failure modes $\mathcal{L}_{FM}$, and downstream task loss $\mathcal{L}_{task}$.
  • Figure 4: Our large-scale correlation study shows a strong negative correlation of performance with MSB (%) as indicated by Pearson's $r$. Linear clusters with negative correlation are also observed for points belonging to each pooling method.
  • Figure 5: Qualitative comparison of segmentation masks predicted on original and shifted images. Regions where TIPS achieve improvements (i.e. consistent segmentation quality) under linear shifts are highlighted with circles.
  • ...and 11 more figures