Scale-invariant Gaussian derivative residual networks

Andrzej Perzanowski; Tony Lindeberg

Scale-invariant Gaussian derivative residual networks

Andrzej Perzanowski, Tony Lindeberg

TL;DR

This paper presents provably scale-invariant Gaussian derivative residual networks (GaussDerResNets), constructed out of scale-covariant Gaussian derivative residual blocks coupled in cascade, aimed at addressing the problem of generalisation across image scales.

Abstract

Generalisation across image scales remains a fundamental challenge for deep networks, which often fail to handle images at scales not seen during training (the out-of-distribution problem). In this paper, we present provably scale-invariant Gaussian derivative residual networks (GaussDerResNets), constructed out of scale-covariant Gaussian derivative residual blocks coupled in cascade, aimed at addressing this problem. By adding residual skip connections to the previous notion of Gaussian derivative layers, deeper networks with substantially increased accuracy can be constructed, while preserving very good scale generalisation properties at the higher level of accuracy. Explicit proofs are provided regarding the underlying scale-covariant and scale-invariant properties in arbitrary dimensions. To analyse the ability of GaussDerResNets to generalise to new scales, we apply them on the new rescaled version of the STL-10 dataset, where training is done at a single fixed scale and evaluation is performed on multiple copies of the test set, each rescaled to a single distinct spatial scale, with scale factors extending over a range of 4. We also conduct similar systematic experiments on the rescaled versions of Fashion-MNIST and CIFAR-10 datasets. Experimentally, we demonstrate that the GaussDerResNets have strong scale generalisation and scale selection properties on all the three rescaled datasets. In our ablation studies, we investigate different architectural variants of GaussDerResNets, demonstrating that basing the architecture on depthwise-separable convolutions allows for decreasing both the number of parameters and the amount of computations, with reasonably maintained accuracy and scale generalisation.

Scale-invariant Gaussian derivative residual networks

TL;DR

Abstract

Paper Structure (44 sections, 52 equations, 22 figures, 5 tables)

This paper contains 44 sections, 52 equations, 22 figures, 5 tables.

Introduction
Contributions and novelty
Related work
Gaussian derivative residual networks
Scale covariance property for Gaussian derivative layers with residual skip connections
Prerequisites
Transformation property under spatial scaling transformations
Generalisation to more general forms of skip connections in composed Gaussian derivative residual networks
Connections between Gaussian derivative residual blocks and semi-discretisations of the diffusion equation
Resulting Gaussian derivative ResNet architecture
Spatial selection mechanisms
Scale covariance properties of Gaussian derivative residual networks
Multi-scale-channel Gaussian derivative residual networks with scale selection by pooling over scales
Scale invariant property of Gaussian derivative ResNets with scale selection mechanisms for image classification
Additional extensions regarding design options
...and 29 more sections

Figures (22)

Figure 1: Schematic illustrations of simplified residual blocks: (left) the simplest possible residual block of the form (\ref{['eq-simplest-resnet-module']}), based on a single convolution kernel $w$, (right) the basic residual block of the form (\ref{['eq-basic-resnet-module']}), based on two convolution kernels $w_1$ and $w_2$. In both the diagrams, the skip connections perform identity mappings on the input $f(x)$, denoted as $id$.
Figure 2: Discretised Gaussian derivative kernels up to order 3, for (left) the scale parameter $\sigma = 1$ and for (right) the scale parameter $\sigma = 4$. These receptive fields are obtained by applying central difference operators (corresponding to the given derivative order) to the discrete analogue of the Gaussian kernel, as defined in Equation (\ref{['eq:disc-gauss+central-diff']}). The ranges of both the axes of the filters are for (the left figure) in the interval $[-5, 5]$, and for (the right figure) in the interval $[-16, 16]$. (Top row) the zero-order Gaussian kernel $g(x_1, x_2;\; \sigma)$, (second row) first-order Gaussian derivatives $g_{x_1}(x_1, x_2;\; \sigma)$ and $g_{x_2}(x_1, x_2;\; \sigma)$, (third row) second-order Gaussian derivatives $g_{x_1 x_1}(x_1, x_2;\; \sigma)$, $g_{x_1 x_2}(x_1, x_2;\; \sigma)$, $g_{x_2 x_2}(x_1, x_2;\; \sigma)$, (bottom row) third-order Gaussian derivatives $g_{x_1 x_1 x_1}(x_1, x_2;\; \sigma)$, $g_{x_1 x_1 x_2}(x_1, x_2;\; \sigma)$, $g_{x_1 x_2 x_2}(x_1, x_2;\; \sigma)$ and $g_{x_2 x_2 x_2}(x_1, x_2;\; \sigma)$.
Figure 3: Illustration of a single-scale-channel Gaussian derivative residual network consisting of 18 Gaussian derivative layers, together with the internal structure of a representative residual block, as well as an inside look into the composition of a Gaussian derivative layer within this residual block. The top part of the diagram illustrates the architecture of a single-scale-channel GaussDerResNet, with the layers organised into residual blocks, with the exception of the first and the last layers, which are regular Gaussian derivative layers. Each residual block is constructed out of two Gaussian derivative layers together with a skip connection, as defined in Equation (\ref{['def-gaussder-residual-block-k+1']}), and shown in detail for the residual block ${\cal M}_{\sigma_{8}}$ in the bottom right of the diagram, with the light grey dotted lines not part of the residual block and shown only for context. Whenever the sizes of the input or output feature channels do not match for a residual block, a standard projection is performed, involving using a 1x1 convolution (not depicted here) to match the dimensions of the residual signal. The single-scale-channel GaussDerResNet itself is parametrised by an initial scale value $\sigma_{0}$, with the scale parameter of each layer determined according to a geometric distribution defined according to Equation (\ref{['eq:sigma-geometric']}), resulting in the sizes of the receptive fields increasing with the network depth, with each residual block using the same scale parameter value for both of its layers. Furthermore, each Gaussian derivative layer in the network consists of a convolution with a linear combination of discretised Gaussian derivative basis kernels $w(x;\; \sigma)$, as visualised in the bottom left of the diagram for layer $\kappa = 14$, where in the top row we see the contributions from the first order Gaussian derivatives, and in the bottom row the contributions from the second order Gaussian derivatives. Each basis kernel in the layer is computed by applying a corresponding central difference operator to a discrete analogue of Gaussian kernel defined with a corresponding scale parameter $\sigma$, as defined by Equation (\ref{['eq:disc-gauss+central-diff']}), while $\tilde{C_{\alpha}} = m(\alpha) \, C_{\alpha } \, \sigma^{|\alpha|}$ represents the learned weights of the layer, with the tilde symbol over the weight parameter $C_{\alpha}$ representing the corresponding scale normalisation factor $\sigma^{|\alpha|}$ and the multinomial normalisation factor $m(\alpha)$, both defined according to Equations (\ref{['eq-gauss-der-layer']}) and (\ref{['eq-gauss-der-layer-in-depthsep']}). Finally, the output of the single-scale-channel GaussDerResNet is obtained by applying a spatial selection stage to the output of the final layer. The entire architecture is formally expressed in Equation (\ref{['eq:func-comp-gamma-expression-new']}).
Figure 4: Commutative diagram for the entire 18-layer Gaussian derivative residual network, illustrating the scale-covariant properties of the architecture. In the diagram, $\mathcal{J}_{\sigma_{i}}$ represents a Gaussian derivative layer at the scale $\sigma_i$, $\mathcal{M}_{\sigma_{i}}$ represents a Gaussian derivative residual block at the scale $\sigma_i$, and both are based on Gaussian derivative primitives up to a given order $\nu$ of spatial differentiation. The scaling operator $\mathcal{S}$ acts on both the image domains and the scale parameters (of the input image $f$ or the layer and residual block outputs $F_{i}$), representing a uniform scaling transformation $(x';\; \sigma')=(S \, x;\; S \, \sigma)$ with a scaling factor $S \in \mathbb{R}_+$. The commutative diagram should be read from the bottom left to the top right, and shows that each level in the cascade of the network can be matched under arbitrary scaling transformations ${\cal S}$ according to $F_{i}^{c_{\text{out}}}(x;\; \sigma_{i})$ = ${F'}_{i}^{c_{\text{out}}}(S \, x;\; S \, \sigma_{i})$, provided that such matching has also been done in the same manner at every previous step in the hierarchy, including the input image. As described in Section \ref{['sec-scale-covariance-properties']}, these scale covariance properties are not affected by the use of residual connections, or the presence of batch normalisation or non-linear $\operatorname{ReLU}$ stages in the network.
Figure 5: Conceptual illustration of a multi-scale-channel Gaussian derivative residual network, composed of 4 parallel single-scale-channel GaussDerResNets, referred to as scale channels. Each scale channel is constructed as shown in the top part of Figure \ref{['fig-single-gaussderresnet-architecture']}, and the $i$:th scale channel denoted as $\Gamma_{\sigma_{i,0}}$, defined in Equation (\ref{['eq:func-comp-gamma-expression-new']}), consisting of the (i) convolutional stage $\Lambda_{\sigma_{i,0}}$, composed of 18 Gaussian derivative layers, represented by the conical frustums (meaning the tapered cylinders) in the diagram, indicating the increase of receptive field size with depth, and (ii) the spatial selection stage. There is a geometric distribution of the spacing between the initial scale values $\sigma_{i,0}$, that the scale channels are based on, set by a fixed ratio, as expressed in Equation (\ref{['eq:sigma0-multichan']}). Each scale channel processes a copy of the input image, and crucially, in order to achieve scale covariance, the scale channels share the same weights, with all the scale channels parametrised by the same relative scale ratio $r$. Finally, instead of a fully connected classification layer, permutation-invariant pooling over the scale channels is performed as a final processing stage in the network. The entire architecture is formally expressed in Equation (\ref{['eq:complete-multi-net-definition']}).
...and 17 more figures

Scale-invariant Gaussian derivative residual networks

TL;DR

Abstract

Scale-invariant Gaussian derivative residual networks

Authors

TL;DR

Abstract

Table of Contents

Figures (22)