Improving Shift Invariance in Convolutional Neural Networks with Translation Invariant Polyphase Sampling

Sourajit Saha; Tejas Gokhale

Improving Shift Invariance in Convolutional Neural Networks with Translation Invariant Polyphase Sampling

Sourajit Saha, Tejas Gokhale

TL;DR

Convolutional downsampling breaks shift invariance, undermining robustness to pixel-level shifts. The authors introduce Translation Invariant Polyphase Sampling (TIPS), a learnable pooling layer that uses polyphase decomposition and trainable mixing coefficients to reduce maximum-sampling bias (MSB), thereby enhancing shift invariance across classification, segmentation, and detection. Two regularizations, L_{FM} and L_{undo}, are proposed to discourage skewed or uniform mixing and to enable undoing standard shifts during training, with end-to-end optimization and marginal overhead. Extensive experiments show that TIPS yields state-of-the-art shift invariance and robustness on multiple benchmarks and architectures, outperforming data augmentation and contrastive methods. The work also provides a large-scale analysis of MSB and demonstrates the practical benefits of reduced MSB for real-world vision tasks.

Abstract

Downsampling operators break the shift invariance of convolutional neural networks (CNNs) and this affects the robustness of features learned by CNNs when dealing with even small pixel-level shift. Through a large-scale correlation analysis framework, we study shift invariance of CNNs by inspecting existing downsampling operators in terms of their maximum-sampling bias (MSB), and find that MSB is negatively correlated with shift invariance. Based on this crucial insight, we propose a learnable pooling operator called Translation Invariant Polyphase Sampling (TIPS) and two regularizations on the intermediate feature maps of TIPS to reduce MSB and learn translation-invariant representations. TIPS can be integrated into any CNN and can be trained end-to-end with marginal computational overhead. Our experiments demonstrate that TIPS results in consistent performance gains in terms of accuracy, shift consistency, and shift fidelity on multiple benchmarks for image classification and semantic segmentation compared to previous methods and also leads to improvements in adversarial and distributional robustness. TIPS results in the lowest MSB compared to all previous methods, thus explaining our strong empirical results.

Improving Shift Invariance in Convolutional Neural Networks with Translation Invariant Polyphase Sampling

TL;DR

Abstract

Paper Structure (28 sections, 6 equations, 16 figures, 21 tables)

This paper contains 28 sections, 6 equations, 16 figures, 21 tables.

Introduction
Related Work
Translation Invariant Polyphase Sampling
TIPS: A Learnable Pooling Layer
Training CNNs with TIPS
Maximum-Sampling Bias
Experiments
Image Classification Experiments
Semantic Segmentation Experiments
Object Detection Experiments
Analysis
Effect of L_undo and L_FM Regularization
Size of Models and Number of TIPS Layers
Performance at Different Levels of Shift
Robustness Evaluation
...and 13 more sections

Figures (16)

Figure 1: Translation-Invariant Polyphase Sampling (TIPS) is a pooling operator that improves shift invariance of CNNs. (a) An illustration of the improvements in semantic segmentation with TIPS; (b) Greater shift consistency of TIPS at higher degrees of pixel shift; (c) TIPS results in consistent and architecture-agnostic improvements in accuracy and four measures of shift invariance for image classification and semantic segmentation.
Figure 2: TIPS downsamples ReLU-activated intermediate feature map $X$ into $\hat{X}$ with stride $s$ and learns polyphase mixing coefficients $\tau$ (using a small fully convolutional function $f_{\theta}$) which results in the output feature map as the weighted linear combination $\hat{X}$. The polyphase decomposition on input feature map $X$ results in $\mathrm{poly}_{i}$ which are then mixed as a weighted linear combination with $\tau$ to compute $\hat{X}$.
Figure 3: The end-to-end training pipeline with TIPS, regularization to undo shift $\mathcal{L}_{undo}$, regularization to discourage known failure modes $\mathcal{L}_{FM}$, and downstream task loss $\mathcal{L}_{task}$.
Figure 4: Our large-scale correlation study shows a strong negative correlation of performance with MSB (%) as indicated by Pearson's $r$. Linear clusters with negative correlation are also observed for points belonging to each pooling method.
Figure 5: Qualitative comparison of segmentation masks predicted on original and shifted images. Regions where TIPS achieve improvements (i.e. consistent segmentation quality) under linear shifts are highlighted with circles.
...and 11 more figures

Improving Shift Invariance in Convolutional Neural Networks with Translation Invariant Polyphase Sampling

TL;DR

Abstract

Improving Shift Invariance in Convolutional Neural Networks with Translation Invariant Polyphase Sampling

Authors

TL;DR

Abstract

Table of Contents

Figures (16)