Table of Contents
Fetching ...

Unsupervised Learning by Predicting Noise

Piotr Bojanowski, Armand Joulin

TL;DR

<3-5 sentence high-level summary> Noise As Targets (NAT) introduces a simple, scalable unsupervised framework for learning visual representations by fixing a set of target prototypes sampled on the unit sphere and training a convnet to align its unit-normalized features to these targets via a 1-to-1 assignment. The method avoids collapse through fixed targets and employs an online, batch-wise assignment update, enabling end-to-end training on massive datasets like ImageNet with standard optimization. Empirically, NAT achieves competitive transfer performance on ImageNet and Pascal VOC compared to leading unsupervised/self-supervised methods, and provides extensive ablations on loss choice, preprocessing, target representations, and training dynamics. The work highlights NAT's simplicity and domain-agnostic nature, while suggesting future exploration of richer target distributions and alignment strategies to further close the gap with supervised performance.

Abstract

Convolutional neural networks provide visual features that perform remarkably well in many computer vision applications. However, training these networks requires significant amounts of supervision. This paper introduces a generic framework to train deep networks, end-to-end, with no supervision. We propose to fix a set of target representations, called Noise As Targets (NAT), and to constrain the deep features to align to them. This domain agnostic approach avoids the standard unsupervised learning issues of trivial solutions and collapsing of features. Thanks to a stochastic batch reassignment strategy and a separable square loss function, it scales to millions of images. The proposed approach produces representations that perform on par with state-of-the-art unsupervised methods on ImageNet and Pascal VOC.

Unsupervised Learning by Predicting Noise

TL;DR

<3-5 sentence high-level summary> Noise As Targets (NAT) introduces a simple, scalable unsupervised framework for learning visual representations by fixing a set of target prototypes sampled on the unit sphere and training a convnet to align its unit-normalized features to these targets via a 1-to-1 assignment. The method avoids collapse through fixed targets and employs an online, batch-wise assignment update, enabling end-to-end training on massive datasets like ImageNet with standard optimization. Empirically, NAT achieves competitive transfer performance on ImageNet and Pascal VOC compared to leading unsupervised/self-supervised methods, and provides extensive ablations on loss choice, preprocessing, target representations, and training dynamics. The work highlights NAT's simplicity and domain-agnostic nature, while suggesting future exploration of richer target distributions and alignment strategies to further close the gap with supervised performance.

Abstract

Convolutional neural networks provide visual features that perform remarkably well in many computer vision applications. However, training these networks requires significant amounts of supervision. This paper introduces a generic framework to train deep networks, end-to-end, with no supervision. We propose to fix a set of target representations, called Noise As Targets (NAT), and to constrain the deep features to align to them. This domain agnostic approach avoids the standard unsupervised learning issues of trivial solutions and collapsing of features. Thanks to a stochastic batch reassignment strategy and a separable square loss function, it scales to millions of images. The proposed approach produces representations that perform on par with state-of-the-art unsupervised methods on ImageNet and Pascal VOC.

Paper Structure

This paper contains 36 sections, 9 equations, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: Our approach takes a set of images, computes their deep features with a convolutional network and matches them to a set of predefined targets from a low dimensional space. The parameters of the network are learned by aligning the features to the targets.
  • Figure 2: On the left, we measure the accuracy on ImageNet after training the features with different permutation rates There is a clear trade-off with an optimum at permutations performed every $3$ epochs. On the right, we measure the accuracy on ImageNet after training the features with our unsupervised approach as a function of the number of epochs. The performance improves with longer unsupervised training.
  • Figure 3: Images and their $3$ nearest neighbors in ImageNet according to our model using an $\ell_2$ distance. The query images are shown on the top row, and the nearest neighbors are sorted from the closer to the further. Our features seem to capture global distinctive structures.
  • Figure 4: Filters form the first layer of an AlexNet trained on ImageNet with supervision (left) or with NAT (right). The filters are in grayscale, since we use grayscale gradient images as input. This visualization shows the composition of the gradients with the first layer.