Table of Contents
Fetching ...

Exploring the ability of CNNs to generalise to previously unseen scales over wide scale ranges

Ylva Jansson, Tony Lindeberg

TL;DR

A theoretical analysis of invariance and covariance properties of scale channel networks is presented and a new type of foveated scale channel architecture is proposed, where the scale channels process increasingly larger parts of the image with decreasing resolution.

Abstract

The ability to handle large scale variations is crucial for many real world visual tasks. A straightforward approach for handling scale in a deep network is to process an image at several scales simultaneously in a set of scale channels. Scale invariance can then, in principle, be achieved by using weight sharing between the scale channels together with max or average pooling over the outputs from the scale channels. The ability of such scale channel networks to generalise to scales not present in the training set over significant scale ranges has, however, not previously been explored. We, therefore, present a theoretical analysis of invariance and covariance properties of scale channel networks and perform an experimental evaluation of the ability of different types of scale channel networks to generalise to previously unseen scales. We identify limitations of previous approaches and propose a new type of foveated scale channel architecture, where the scale channels process increasingly larger parts of the image with decreasing resolution. Our proposed FovMax and FovAvg networks perform almost identically over a scale range of 8, also when training on single scale training data, and do also give improvements in the small sample regime.

Exploring the ability of CNNs to generalise to previously unseen scales over wide scale ranges

TL;DR

A theoretical analysis of invariance and covariance properties of scale channel networks is presented and a new type of foveated scale channel architecture is proposed, where the scale channels process increasingly larger parts of the image with decreasing resolution.

Abstract

The ability to handle large scale variations is crucial for many real world visual tasks. A straightforward approach for handling scale in a deep network is to process an image at several scales simultaneously in a set of scale channels. Scale invariance can then, in principle, be achieved by using weight sharing between the scale channels together with max or average pooling over the outputs from the scale channels. The ability of such scale channel networks to generalise to scales not present in the training set over significant scale ranges has, however, not previously been explored. We, therefore, present a theoretical analysis of invariance and covariance properties of scale channel networks and perform an experimental evaluation of the ability of different types of scale channel networks to generalise to previously unseen scales. We identify limitations of previous approaches and propose a new type of foveated scale channel architecture, where the scale channels process increasingly larger parts of the image with decreasing resolution. Our proposed FovMax and FovAvg networks perform almost identically over a scale range of 8, also when training on single scale training data, and do also give improvements in the small sample regime.

Paper Structure

This paper contains 35 sections, 47 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Foveated scale channel networks. a) Foveated scale channel network that process an image of the digit 2. Since each scale channel has a fixed size receptive field/support region in the scale channels, they will together process input regions corresponding to varying sizes in the original image (circles of corresponding colors). b) This corresponds to a type of foveated processing, where the center of the image is processed with high resolution, which works well to detect small objects, while larger regions are processed using gradually reduced resolution, which enables detection of larger objects. c) There is a close similarity between this model and the foveal scale space model CVAP166, which was motivated by a combination of regular scale space axioms with a complementary assumption of a uniform limited processing capacity at all scales.
  • Figure 2: Generalisation ability to unseen scales for a standard CNN and the different scale channel network architectures. The networks are trained on digits of scale 1 (tr1), scale 2 (tr2) or scale 4 (tr4) and evaluated for varying rescalings of the test set. We note that the CNN (a) and the FovConc network (b) have poor generalisation ability to unseen scales, while the FovMax and FovAvg networks (c) generalise extremely well. The SWMax network (d) generalises considerably better than a standard CNN, but there is some drop in performance for scales not seen during training.
  • Figure 3: Varying the sampling density of the scale channels. FovMax and FovAvg networks spanning the scale range $[\frac{1}{4},8]$ are trained with varying spacing between the scale channels ($2$, $2^{1/2}$ and $2^{1/4}$). All networks are trained on scale 2. There is a significant increase in the performance when reducing the spacing between the scale channels from $2$ to $2^{1/2}$ while the effect of a further reduction to $2^{1/4}$ is small.
  • Figure 4: Multiscale image data. All networks are trained on digits in the scale range $[1,4]$ (tr1-4) and evaluated for varying scale factors in the test set. The difference in generalisation ability between training on multiscale and single scale data (dotted lines) is striking for both the CNN and the FovConc network. For the FovMax and FovAvg networks, the difference is negligible between multiscale and single scale training, which illustrates the strong invariance properties of these networks.
  • Figure 5: Training with smaller training sets with large scale variations. All network architectures are evaluated on their ability to classify data with large scale variations while reducing the number of training samples. Both the training and test set here span the scale range $[1,4]$. The FovAvg network shows the highest robustness when decreasing the number of training samples followed by the FovMax network.
  • ...and 1 more figures