Table of Contents
Fetching ...

How Much Position Information Do Convolutional Neural Networks Encode?

Md Amirul Islam, Sen Jia, Neil D. B. Bruce

TL;DR

The paper investigates whether CNNs encode absolute position information despite local receptive fields and proposes PosENet, a readout that extracts position maps from frozen encoder features using synthetic ground-truth position maps for supervision. Ground-truth maps are generated to be content-independent, and the model is trained with a pixel-wise loss between predicted and ground-truth maps. Across pretrained backbones and data types, the study finds strong evidence that absolute position information is encoded, with zero-padding at borders identified as a major source and deeper features carrying stronger signals. These findings challenge assumptions about spatial invariance in CNNs and have implications for location-sensitive tasks and for understanding how padding shapes feature representations.

Abstract

In contrast to fully connected networks, Convolutional Neural Networks (CNNs) achieve efficiency by learning weights associated with local filters with a finite spatial extent. An implication of this is that a filter may know what it is looking at, but not where it is positioned in the image. Information concerning absolute position is inherently useful, and it is reasonable to assume that deep CNNs may implicitly learn to encode this information if there is a means to do so. In this paper, we test this hypothesis revealing the surprising degree of absolute position information that is encoded in commonly used neural networks. A comprehensive set of experiments show the validity of this hypothesis and shed light on how and where this information is represented while offering clues to where positional information is derived from in deep CNNs.

How Much Position Information Do Convolutional Neural Networks Encode?

TL;DR

The paper investigates whether CNNs encode absolute position information despite local receptive fields and proposes PosENet, a readout that extracts position maps from frozen encoder features using synthetic ground-truth position maps for supervision. Ground-truth maps are generated to be content-independent, and the model is trained with a pixel-wise loss between predicted and ground-truth maps. Across pretrained backbones and data types, the study finds strong evidence that absolute position information is encoded, with zero-padding at borders identified as a major source and deeper features carrying stronger signals. These findings challenge assumptions about spatial invariance in CNNs and have implications for location-sensitive tasks and for understanding how padding shapes feature representations.

Abstract

In contrast to fully connected networks, Convolutional Neural Networks (CNNs) achieve efficiency by learning weights associated with local filters with a finite spatial extent. An implication of this is that a filter may know what it is looking at, but not where it is positioned in the image. Information concerning absolute position is inherently useful, and it is reasonable to assume that deep CNNs may implicitly learn to encode this information if there is a means to do so. In this paper, we test this hypothesis revealing the surprising degree of absolute position information that is encoded in commonly used neural networks. A comprehensive set of experiments show the validity of this hypothesis and shed light on how and where this information is represented while offering clues to where positional information is derived from in deep CNNs.

Paper Structure

This paper contains 15 sections, 4 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Sample predictions for salient regions for input images (left), and a slightly cropped version (right). Cropping results in a shift in position rightward of features relative to the centre. It is notable that this has a significant impact on output and decision of regions deemed salient despite no explicit position encoding and a modest change to position in the input.
  • Figure 2: Illustration of PosENet architecture.
  • Figure 3: Sample images and generated gradient-like ground-truth position maps.
  • Figure 4: Qualitative results of PosENet based networks corresponding to different ground-truth patterns.
  • Figure 5: The effect of more Layers (Top row) and varying Kernel Size (bottom row) applied in the PoseNet. Order (left $\rightarrow$ right): GT (G), PosENet (L=1, KS=1), PosENet (L=2, KS=3), PosENet (L=3, KS=7), VGG (L=1, KS=1), VGG (L=2, KS=3), VGG (L=3, KS=7).
  • ...and 2 more figures