Table of Contents
Fetching ...

Inability of spatial transformations of CNN feature maps to support invariant recognition

Ylva Jansson, Maksim Maydanskiy, Lukas Finnveden, Tony Lindeberg

TL;DR

The paper investigates whether purely spatial transformations of CNN feature maps can render affine-transformations invariant, showing that alignment after feature extraction is generally impossible unless the network's features are already invariant to the transformation. It develops a rigorous, elementary analysis for both single- and multi-layer, translation-covariant, semi-local CNNs, introducing a generator framework and semi-locality to prove that the only viable post-transform alignment is the inverse spatial transform ${\cal{T}}_h^{-1}$, which requires the features themselves to be invariant. Consequently, invariance to affine, scale, or shear transformations cannot be achieved through spatial transformations of feature maps alone; rotation/reflection invariance is possible only if the features are themselves rotation/reflection invariant. The results have direct implications for spatial transformer networks and related methods, indicating that such approaches cannot replace input alignment for general affine invariance and should instead aim to build invariance into the features themselves. Overall, the work clarifies fundamental limits of post-hoc spatial transformations in CNNs and guides the design of invariant representations.

Abstract

A large number of deep learning architectures use spatial transformations of CNN feature maps or filters to better deal with variability in object appearance caused by natural image transformations. In this paper, we prove that spatial transformations of CNN feature maps cannot align the feature maps of a transformed image to match those of its original, for general affine transformations, unless the extracted features are themselves invariant. Our proof is based on elementary analysis for both the single- and multi-layer network case. The results imply that methods based on spatial transformations of CNN feature maps or filters cannot replace image alignment of the input and cannot enable invariant recognition for general affine transformations, specifically not for scaling transformations or shear transformations. For rotations and reflections, spatially transforming feature maps or filters can enable invariance but only for networks with learnt or hardcoded rotation- or reflection-invariant features

Inability of spatial transformations of CNN feature maps to support invariant recognition

TL;DR

The paper investigates whether purely spatial transformations of CNN feature maps can render affine-transformations invariant, showing that alignment after feature extraction is generally impossible unless the network's features are already invariant to the transformation. It develops a rigorous, elementary analysis for both single- and multi-layer, translation-covariant, semi-local CNNs, introducing a generator framework and semi-locality to prove that the only viable post-transform alignment is the inverse spatial transform , which requires the features themselves to be invariant. Consequently, invariance to affine, scale, or shear transformations cannot be achieved through spatial transformations of feature maps alone; rotation/reflection invariance is possible only if the features are themselves rotation/reflection invariant. The results have direct implications for spatial transformer networks and related methods, indicating that such approaches cannot replace input alignment for general affine invariance and should instead aim to build invariance into the features themselves. Overall, the work clarifies fundamental limits of post-hoc spatial transformations in CNNs and guides the design of invariant representations.

Abstract

A large number of deep learning architectures use spatial transformations of CNN feature maps or filters to better deal with variability in object appearance caused by natural image transformations. In this paper, we prove that spatial transformations of CNN feature maps cannot align the feature maps of a transformed image to match those of its original, for general affine transformations, unless the extracted features are themselves invariant. Our proof is based on elementary analysis for both the single- and multi-layer network case. The results imply that methods based on spatial transformations of CNN feature maps or filters cannot replace image alignment of the input and cannot enable invariant recognition for general affine transformations, specifically not for scaling transformations or shear transformations. For rotations and reflections, spatially transforming feature maps or filters can enable invariance but only for networks with learnt or hardcoded rotation- or reflection-invariant features

Paper Structure

This paper contains 27 sections, 53 equations, 3 figures.

Figures (3)

  • Figure 1: Commutative diagram for a covariant feature extractor $\Lambda$.
  • Figure 2: An inverse spatial transformation of a CNN feature map cannot, in general, align the feature maps of a transformed image with those of its original. Here, the network $\Lambda$ has two feature channels "W" and "M", and $T_g$ corresponds to a 180$^\circ$ rotation. Since different feature channels respond to the rotated image as compared to the original image, it is not possible to align the respective feature maps with a spatial rotation. In fact, spatially transforming feature maps can, in most cases, not eliminate differences related to object pose and can thus not enable invariant recognition.
  • Figure 3: For any transformation that includes a scaling component, the field of view of a feature extractor with respect to an object will differ between an original and rescaled image. Consider e.g. a simple linear model that performs template matching with a single filter. When applied to the original image, the filter matches the size of the object that it has been trained to recognize and thus responds strongly. When applied to a rescaled image, the filter never covers the full object of interest, and thus the response cannot be guaranteed to take even the same set of values for a rescaled image and its original.

Theorems & Definitions (22)

  • proof : Sketch
  • proof
  • proof
  • proof
  • proof
  • proof
  • proof
  • proof
  • proof
  • proof
  • ...and 12 more