Improving Shift Invariance in Convolutional Neural Networks with Translation Invariant Polyphase Sampling
Sourajit Saha, Tejas Gokhale
TL;DR
Convolutional downsampling breaks shift invariance, undermining robustness to pixel-level shifts. The authors introduce Translation Invariant Polyphase Sampling (TIPS), a learnable pooling layer that uses polyphase decomposition and trainable mixing coefficients to reduce maximum-sampling bias (MSB), thereby enhancing shift invariance across classification, segmentation, and detection. Two regularizations, L_{FM} and L_{undo}, are proposed to discourage skewed or uniform mixing and to enable undoing standard shifts during training, with end-to-end optimization and marginal overhead. Extensive experiments show that TIPS yields state-of-the-art shift invariance and robustness on multiple benchmarks and architectures, outperforming data augmentation and contrastive methods. The work also provides a large-scale analysis of MSB and demonstrates the practical benefits of reduced MSB for real-world vision tasks.
Abstract
Downsampling operators break the shift invariance of convolutional neural networks (CNNs) and this affects the robustness of features learned by CNNs when dealing with even small pixel-level shift. Through a large-scale correlation analysis framework, we study shift invariance of CNNs by inspecting existing downsampling operators in terms of their maximum-sampling bias (MSB), and find that MSB is negatively correlated with shift invariance. Based on this crucial insight, we propose a learnable pooling operator called Translation Invariant Polyphase Sampling (TIPS) and two regularizations on the intermediate feature maps of TIPS to reduce MSB and learn translation-invariant representations. TIPS can be integrated into any CNN and can be trained end-to-end with marginal computational overhead. Our experiments demonstrate that TIPS results in consistent performance gains in terms of accuracy, shift consistency, and shift fidelity on multiple benchmarks for image classification and semantic segmentation compared to previous methods and also leads to improvements in adversarial and distributional robustness. TIPS results in the lowest MSB compared to all previous methods, thus explaining our strong empirical results.
