Table of Contents
Fetching ...

Representation Learning by Learning to Count

Mehdi Noroozi, Hamed Pirsiavash, Paolo Favaro

TL;DR

This work introduces a self-supervised representation learning method that uses counting of visual primitives as an artificial supervision signal. By enforcing an equivariance-like constraint between image downsampling and tiling transformations in feature space, the authors train a neural network with a contrastive loss to avoid trivial solutions. The resulting representations achieve competitive or superior transfer-learning performance on benchmarks such as Pascal VOC and ImageNet/Places, and extensive analyses show the learned features encode high-level visual content and scene structure. The approach highlights counting as a versatile pretext task and suggests extensions to other transformations and semi-supervised settings.

Abstract

We introduce a novel method for representation learning that uses an artificial supervision signal based on counting visual primitives. This supervision signal is obtained from an equivariance relation, which does not require any manual annotation. We relate transformations of images to transformations of the representations. More specifically, we look for the representation that satisfies such relation rather than the transformations that match a given representation. In this paper, we use two image transformations in the context of counting: scaling and tiling. The first transformation exploits the fact that the number of visual primitives should be invariant to scale. The second transformation allows us to equate the total number of visual primitives in each tile to that in the whole image. These two transformations are combined in one constraint and used to train a neural network with a contrastive loss. The proposed task produces representations that perform on par or exceed the state of the art in transfer learning benchmarks.

Representation Learning by Learning to Count

TL;DR

This work introduces a self-supervised representation learning method that uses counting of visual primitives as an artificial supervision signal. By enforcing an equivariance-like constraint between image downsampling and tiling transformations in feature space, the authors train a neural network with a contrastive loss to avoid trivial solutions. The resulting representations achieve competitive or superior transfer-learning performance on benchmarks such as Pascal VOC and ImageNet/Places, and extensive analyses show the learned features encode high-level visual content and scene structure. The approach highlights counting as a versatile pretext task and suggests extensions to other transformations and semi-supervised settings.

Abstract

We introduce a novel method for representation learning that uses an artificial supervision signal based on counting visual primitives. This supervision signal is obtained from an equivariance relation, which does not require any manual annotation. We relate transformations of images to transformations of the representations. More specifically, we look for the representation that satisfies such relation rather than the transformations that match a given representation. In this paper, we use two image transformations in the context of counting: scaling and tiling. The first transformation exploits the fact that the number of visual primitives should be invariant to scale. The second transformation allows us to equate the total number of visual primitives in each tile to that in the whole image. These two transformations are combined in one constraint and used to train a neural network with a contrastive loss. The proposed task produces representations that perform on par or exceed the state of the art in transfer learning benchmarks.

Paper Structure

This paper contains 13 sections, 6 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: The number of visual primitives in the whole image should match the sum of the number of visual primitives in each tile (dashed red boxes).
  • Figure 2: Training AlexNet to learn to count. The proposed architecture uses a siamese arrangement so that we simultaneously produce features for $4$ tiles and a downsampled image. We also compute the feature from a randomly chosen downsampled image ($D\circ \mathbf{y}$) as a contrastive term.
  • Figure 3: Average response of our trained network on the ImageNet validation set. Despite its sparsity ($30$ non zero entries), the hidden representation in the trained network performs well when transferred to the classification, detection and segmentation tasks.
  • Figure 4:
  • Figure 5:
  • ...and 6 more figures