Table of Contents
Fetching ...

The problems with using STNs to align CNN feature maps

Lukas Finnveden, Ylva Jansson, Tony Lindeberg

TL;DR

A theoretical argument for this and the practical implications are investigated, showing that this inability to align the feature maps of a transformed image and its original is coupled with decreased classification accuracy.

Abstract

Spatial transformer networks (STNs) were designed to enable CNNs to learn invariance to image transformations. STNs were originally proposed to transform CNN feature maps as well as input images. This enables the use of more complex features when predicting transformation parameters. However, since STNs perform a purely spatial transformation, they do not, in the general case, have the ability to align the feature maps of a transformed image and its original. We present a theoretical argument for this and investigate the practical implications, showing that this inability is coupled with decreased classification accuracy. We advocate taking advantage of more complex features in deeper layers by instead sharing parameters between the classification and the localisation network.

The problems with using STNs to align CNN feature maps

TL;DR

A theoretical argument for this and the practical implications are investigated, showing that this inability to align the feature maps of a transformed image and its original is coupled with decreased classification accuracy.

Abstract

Spatial transformer networks (STNs) were designed to enable CNNs to learn invariance to image transformations. STNs were originally proposed to transform CNN feature maps as well as input images. This enables the use of more complex features when predicting transformation parameters. However, since STNs perform a purely spatial transformation, they do not, in the general case, have the ability to align the feature maps of a transformed image and its original. We present a theoretical argument for this and investigate the practical implications, showing that this inability is coupled with decreased classification accuracy. We advocate taking advantage of more complex features in deeper layers by instead sharing parameters between the classification and the localisation network.

Paper Structure

This paper contains 4 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Inversely transforming the feature map will, in general, not align the feature maps of a transformed image and those of its original. The network $\Gamma$ has two feature channels "W" and "M". $T_g$ corresponds to a 180$^\circ$ rotation.
  • Figure 2: Visualisation of image/feature map alignment for rotated and translated MNIST images (top rows). STN-C1 fails to compensate for rotations but performs better for translations (middle rows). STN-SL1 finds a canonical pose both for rotated and translated images (bottom rows).
  • Figure 3: The rotation angle predicted by the ST module for MNIST images as a function of the rotation applied to the input image. STN-C1 has not learned to predict the image orientation (left). The reason for this is that a rotation is, in fact, not enough to align deeper layer feature maps. This is because a rotation of the feature map does not correspond to a rotation of the input. STN-SL1, which transforms the input, correctly predicts the image orientation (right).