Table of Contents
Fetching ...

Understanding when spatial transformer networks do not support invariance, and what to do about it

Lukas Finnveden, Ylva Jansson, Tony Lindeberg

TL;DR

The paper proves that spatial transformer networks (STNs) cannot generally induce invariance by transforming CNN feature maps, due to pure spatial transforms not aligning transformed feature maps with the original under affine changes and due to channel shifts and non-invariant receptive fields. It analyzes STN architectures, showing input-space transformations outperform feature-map transformations for rotation and scale, while deeper, shared localization networks improve stability and accuracy when predicting transformation parameters. Iterative alignment provides additional gains but is not a substitute for rich, deep features. The results guide STN design for practical invariance and robustness across diverse datasets like MNIST, SVHN, and PlanktonSet, with implications for related spatial-transform-based methods.

Abstract

Spatial transformer networks (STNs) were designed to enable convolutional neural networks (CNNs) to learn invariance to image transformations. STNs were originally proposed to transform CNN feature maps as well as input images. This enables the use of more complex features when predicting transformation parameters. However, since STNs perform a purely spatial transformation, they do not, in the general case, have the ability to align the feature maps of a transformed image with those of its original. STNs are therefore unable to support invariance when transforming CNN feature maps. We present a simple proof for this and study the practical implications, showing that this inability is coupled with decreased classification accuracy. We therefore investigate alternative STN architectures that make use of complex features. We find that while deeper localization networks are difficult to train, localization networks that share parameters with the classification network remain stable as they grow deeper, which allows for higher classification accuracy on difficult datasets. Finally, we explore the interaction between localization network complexity and iterative image alignment.

Understanding when spatial transformer networks do not support invariance, and what to do about it

TL;DR

The paper proves that spatial transformer networks (STNs) cannot generally induce invariance by transforming CNN feature maps, due to pure spatial transforms not aligning transformed feature maps with the original under affine changes and due to channel shifts and non-invariant receptive fields. It analyzes STN architectures, showing input-space transformations outperform feature-map transformations for rotation and scale, while deeper, shared localization networks improve stability and accuracy when predicting transformation parameters. Iterative alignment provides additional gains but is not a substitute for rich, deep features. The results guide STN design for practical invariance and robustness across diverse datasets like MNIST, SVHN, and PlanktonSet, with implications for related spatial-transform-based methods.

Abstract

Spatial transformer networks (STNs) were designed to enable convolutional neural networks (CNNs) to learn invariance to image transformations. STNs were originally proposed to transform CNN feature maps as well as input images. This enables the use of more complex features when predicting transformation parameters. However, since STNs perform a purely spatial transformation, they do not, in the general case, have the ability to align the feature maps of a transformed image with those of its original. STNs are therefore unable to support invariance when transforming CNN feature maps. We present a simple proof for this and study the practical implications, showing that this inability is coupled with decreased classification accuracy. We therefore investigate alternative STN architectures that make use of complex features. We find that while deeper localization networks are difficult to train, localization networks that share parameters with the classification network remain stable as they grow deeper, which allows for higher classification accuracy on difficult datasets. Finally, we explore the interaction between localization network complexity and iterative image alignment.

Paper Structure

This paper contains 33 sections, 17 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: A spatial transformation of a CNN feature map cannot, in general, align the feature maps of a transformed image with those of its original. Here, the network $\Gamma$ has two feature channels "W" and "M", and $T_g$ corresponds to a 180$^\circ$ rotation. Since different feature channels respond to the rotated image as compared to the original image, it is not possible to align the respective feature maps by applying the inverse spatial rotation to the feature maps. This implies that spatially transforming feature maps cannot enable invariant recognition by the means of aligning a set of feature maps to a common pose.
  • Figure 2: For any transformation that includes a scaling component, the field of view of a feature extractor with respect to an object will differ between an original and a rescaled image. Consider a simple linear model that performs template matching with a single filter. When applied to the original image, the filter matches the size of the object that it has been trained to recognize and thus responds strongly. When applied to a rescaled image, the filter never covers the full object of interest. Thus, the response cannot be guaranteed to take even the same set of values for a rescaled image and its original.
  • Figure 3: Depiction of four different ways to build STNs. LOC denotes the localization network, which predicts the parameters of a transformation. ST denotes the spatial transformer, which takes these parameters and transforms an image or feature map according to them. In STN-C0, the ST transforms the input image. In STN-CX, the ST transforms a feature map, which prevents proper invariance. STN-DLX transforms the input image, but makes use of deeper features by including copies of the first X convolutional layers in the localization network. This is not fundamentally different from (1) but acts as a useful comparison point. STN-SLX is similar to STN-DLX, but shares parameters between the classification and localization networks.
  • Figure 4: Depiction of how an STN transforming CNN feature maps at different depths can be transformed into an iterative STN with shared layers. STN-C0123 transforms feature maps by placing STs at multiple depths JadSimZisKav-NIPS2015. STN-SL0123 instead iteratively transforms the input image and, in addition, shares parameters between the localisation networks and the classification network. The image is fed multiple times through the first layers of the network, each time producing an update to the transformation parameters. Thus, the transformation is, similarly to STN-C0123, iteratively finetuned based on more and more complex features but, at the same time, the ability to support invariant recognition is preserved.
  • Figure 5: Illustration of how STN-C1 and STN-SL1 compensate for different perturbations. The top row shows three digits rotated (first image), translated (second image), or scaled (third image) in three different ways. The middle row and bottom row show how STN-C1 and STN-SL1 transform the digits in the top row. STN-C1 does not compensate for rotations at all, but it successfully localizes and zooms in on translated digits. It only compensates somewhat for scaling. STN-SL1 finds a canonical pose for all perturbations. Note that STN-C1 does not transform the input image, so the middle row is just an illustration of the transformation parameters that are normally used to transform its CNN feature map.
  • ...and 2 more figures