Table of Contents
Fetching ...

Spatial Transformer Networks

Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu

TL;DR

CNNs struggle with spatial invariance without costly depth or pooling. The Spatial Transformer introduces a differentiable, input-conditioned module (localisation network, parameterised grid, and differentiable sampler) that warps feature maps to a canonical pose or focus region, trained end-to-end. Empirical results show state-of-the-art performance on distorted MNIST, SVHN, and improved fine-grained bird classification by learning to crop, rotate, and align regions or parts without extra supervision. This plug-in mechanism enables attention and pose normalization within standard architectures, with potential extensions to 3D and recurrent models.

Abstract

Convolutional Neural Networks define an exceptionally powerful class of models, but are still limited by the lack of ability to be spatially invariant to the input data in a computationally and parameter efficient manner. In this work we introduce a new learnable module, the Spatial Transformer, which explicitly allows the spatial manipulation of data within the network. This differentiable module can be inserted into existing convolutional architectures, giving neural networks the ability to actively spatially transform feature maps, conditional on the feature map itself, without any extra training supervision or modification to the optimisation process. We show that the use of spatial transformers results in models which learn invariance to translation, scale, rotation and more generic warping, resulting in state-of-the-art performance on several benchmarks, and for a number of classes of transformations.

Spatial Transformer Networks

TL;DR

CNNs struggle with spatial invariance without costly depth or pooling. The Spatial Transformer introduces a differentiable, input-conditioned module (localisation network, parameterised grid, and differentiable sampler) that warps feature maps to a canonical pose or focus region, trained end-to-end. Empirical results show state-of-the-art performance on distorted MNIST, SVHN, and improved fine-grained bird classification by learning to crop, rotate, and align regions or parts without extra supervision. This plug-in mechanism enables attention and pose normalization within standard architectures, with potential extensions to 3D and recurrent models.

Abstract

Convolutional Neural Networks define an exceptionally powerful class of models, but are still limited by the lack of ability to be spatially invariant to the input data in a computationally and parameter efficient manner. In this work we introduce a new learnable module, the Spatial Transformer, which explicitly allows the spatial manipulation of data within the network. This differentiable module can be inserted into existing convolutional architectures, giving neural networks the ability to actively spatially transform feature maps, conditional on the feature map itself, without any extra training supervision or modification to the optimisation process. We show that the use of spatial transformers results in models which learn invariance to translation, scale, rotation and more generic warping, resulting in state-of-the-art performance on several benchmarks, and for a number of classes of transformations.

Paper Structure

This paper contains 22 sections, 10 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: The result of using a spatial transformer as the first layer of a fully-connected network trained for distorted MNIST digit classification. (a) The input to the spatial transformer network is an image of an MNIST digit that is distorted with random translation, scale, rotation, and clutter. (b) The localisation network of the spatial transformer predicts a transformation to apply to the input image. (c) The output of the spatial transformer, after applying the transformation. (d) The classification prediction produced by the subsequent fully-connected network on the output of the spatial transformer. The spatial transformer network (a CNN including a spatial transformer module) is trained end-to-end with only class labels -- no knowledge of the groundtruth transformations is given to the system.
  • Figure 2: The architecture of a spatial transformer module. The input feature map $U$ is passed to a localisation network which regresses the transformation parameters $\theta$. The regular spatial grid $G$ over $V$ is transformed to the sampling grid ${\cal T}_\theta (G)$, which is applied to $U$ as described in Sect. \ref{['sec:gridsampler']}, producing the warped output feature map $V$. The combination of the localisation network and sampling mechanism defines a spatial transformer.
  • Figure 3: Two examples of applying the parameterised sampling grid to an image $U$ producing the output $V$. (a) The sampling grid is the regular grid $G = {\cal T}_I (G)$, where $I$ is the identity transformation parameters. (b) The sampling grid is the result of warping the regular grid with an affine transformation ${\cal T}_\theta (G)$.
  • Figure 4: A look at the optimisation dynamics for co-localisation. Here we show the localisation predicted by the spatial transformer for three of the 100 dataset images after the SGD step labelled below. By SGD step 180 the model has process has correctly localised the three digits. A full animation is shown in the video https://goo.gl/qdEhUu
  • Figure 5: The behaviour of a trained 3D MNIST classifier on a test example. The 3D voxel input contains a random MNIST digit which has been extruded and randomly placed inside a $60\times 60 \times 60$ volume. A 3D spatial transformer performs a transformation of the input, producing an output volume whose depth is then flattened. This creates a 2D projection of the 3D space, which the subsequent layers of the network are able to classify. The whole network is trained end-to-end with just classification labels.
  • ...and 1 more figures