Unsupervised Learning by Predicting Noise
Piotr Bojanowski, Armand Joulin
TL;DR
<3-5 sentence high-level summary> Noise As Targets (NAT) introduces a simple, scalable unsupervised framework for learning visual representations by fixing a set of target prototypes sampled on the unit sphere and training a convnet to align its unit-normalized features to these targets via a 1-to-1 assignment. The method avoids collapse through fixed targets and employs an online, batch-wise assignment update, enabling end-to-end training on massive datasets like ImageNet with standard optimization. Empirically, NAT achieves competitive transfer performance on ImageNet and Pascal VOC compared to leading unsupervised/self-supervised methods, and provides extensive ablations on loss choice, preprocessing, target representations, and training dynamics. The work highlights NAT's simplicity and domain-agnostic nature, while suggesting future exploration of richer target distributions and alignment strategies to further close the gap with supervised performance.
Abstract
Convolutional neural networks provide visual features that perform remarkably well in many computer vision applications. However, training these networks requires significant amounts of supervision. This paper introduces a generic framework to train deep networks, end-to-end, with no supervision. We propose to fix a set of target representations, called Noise As Targets (NAT), and to constrain the deep features to align to them. This domain agnostic approach avoids the standard unsupervised learning issues of trivial solutions and collapsing of features. Thanks to a stochastic batch reassignment strategy and a separable square loss function, it scales to millions of images. The proposed approach produces representations that perform on par with state-of-the-art unsupervised methods on ImageNet and Pascal VOC.
