Inability of spatial transformations of CNN feature maps to support invariant recognition
Ylva Jansson, Maksim Maydanskiy, Lukas Finnveden, Tony Lindeberg
TL;DR
The paper investigates whether purely spatial transformations of CNN feature maps can render affine-transformations invariant, showing that alignment after feature extraction is generally impossible unless the network's features are already invariant to the transformation. It develops a rigorous, elementary analysis for both single- and multi-layer, translation-covariant, semi-local CNNs, introducing a generator framework and semi-locality to prove that the only viable post-transform alignment is the inverse spatial transform ${\cal{T}}_h^{-1}$, which requires the features themselves to be invariant. Consequently, invariance to affine, scale, or shear transformations cannot be achieved through spatial transformations of feature maps alone; rotation/reflection invariance is possible only if the features are themselves rotation/reflection invariant. The results have direct implications for spatial transformer networks and related methods, indicating that such approaches cannot replace input alignment for general affine invariance and should instead aim to build invariance into the features themselves. Overall, the work clarifies fundamental limits of post-hoc spatial transformations in CNNs and guides the design of invariant representations.
Abstract
A large number of deep learning architectures use spatial transformations of CNN feature maps or filters to better deal with variability in object appearance caused by natural image transformations. In this paper, we prove that spatial transformations of CNN feature maps cannot align the feature maps of a transformed image to match those of its original, for general affine transformations, unless the extracted features are themselves invariant. Our proof is based on elementary analysis for both the single- and multi-layer network case. The results imply that methods based on spatial transformations of CNN feature maps or filters cannot replace image alignment of the input and cannot enable invariant recognition for general affine transformations, specifically not for scaling transformations or shear transformations. For rotations and reflections, spatially transforming feature maps or filters can enable invariance but only for networks with learnt or hardcoded rotation- or reflection-invariant features
