Table of Contents
Fetching ...

Unsupervised Visual Representation Learning by Context Prediction

Carl Doersch, Abhinav Gupta, Alexei A. Efros

TL;DR

The paper introduces a self-supervised objective that learns visual representations by predicting the relative position of patch pairs within the same image, using a twin-branch ConvNet with late fusion. This context-prediction approach yields fc6 embeddings that capture semantically meaningful similarities, enabling unsupervised object discovery and improving performance when pre-trained features are transferred to tasks like Pascal VOC detection. The authors demonstrate versatility across object detection, geometry estimation, and visual data mining, while addressing potential shortcuts such as chromatic aberration. Overall, the method shows that vast unlabeled image collections can yield rich, transferable visual representations, reducing reliance on costly annotations.

Abstract

This work explores the use of spatial context as a source of free and plentiful supervisory signal for training a rich visual representation. Given only a large, unlabeled image collection, we extract random pairs of patches from each image and train a convolutional neural net to predict the position of the second patch relative to the first. We argue that doing well on this task requires the model to learn to recognize objects and their parts. We demonstrate that the feature representation learned using this within-image context indeed captures visual similarity across images. For example, this representation allows us to perform unsupervised visual discovery of objects like cats, people, and even birds from the Pascal VOC 2011 detection dataset. Furthermore, we show that the learned ConvNet can be used in the R-CNN framework and provides a significant boost over a randomly-initialized ConvNet, resulting in state-of-the-art performance among algorithms which use only Pascal-provided training set annotations.

Unsupervised Visual Representation Learning by Context Prediction

TL;DR

The paper introduces a self-supervised objective that learns visual representations by predicting the relative position of patch pairs within the same image, using a twin-branch ConvNet with late fusion. This context-prediction approach yields fc6 embeddings that capture semantically meaningful similarities, enabling unsupervised object discovery and improving performance when pre-trained features are transferred to tasks like Pascal VOC detection. The authors demonstrate versatility across object detection, geometry estimation, and visual data mining, while addressing potential shortcuts such as chromatic aberration. Overall, the method shows that vast unlabeled image collections can yield rich, transferable visual representations, reducing reliance on costly annotations.

Abstract

This work explores the use of spatial context as a source of free and plentiful supervisory signal for training a rich visual representation. Given only a large, unlabeled image collection, we extract random pairs of patches from each image and train a convolutional neural net to predict the position of the second patch relative to the first. We argue that doing well on this task requires the model to learn to recognize objects and their parts. We demonstrate that the feature representation learned using this within-image context indeed captures visual similarity across images. For example, this representation allows us to perform unsupervised visual discovery of objects like cats, people, and even birds from the Pascal VOC 2011 detection dataset. Furthermore, we show that the learned ConvNet can be used in the R-CNN framework and provides a significant boost over a randomly-initialized ConvNet, resulting in state-of-the-art performance among algorithms which use only Pascal-provided training set annotations.

Paper Structure

This paper contains 12 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Our task for learning patch representations involves randomly sampling a patch (blue) and then one of eight possible neighbors (red). Can you guess the spatial configuration for the two pairs of patches? Note that the task is much easier once you have recognized the object!
  • Figure 2: The algorithm receives two patches in one of these eight possible spatial arrangements, without any context, and must then classify which configuration was sampled.
  • Figure 3: Our architecture for pair classification. Dotted lines indicate shared weights. 'conv' stands for a convolution layer, 'fc' stands for a fully-connected one, 'pool' is a max-pooling layer, and 'LRN' is a local response normalization layer. Numbers in parentheses are kernel size, number of outputs, and stride (fc layers have only a number of outputs). The LRN parameters follow krizhevsky2012imagenet. All conv and fc layers are followed by ReLU nonlinearities, except fc9 which feeds into a softmax classifier.
  • Figure 4: Examples of patch clusters obtained by nearest neighbors. The query patch is shown on the far left. Matches are for three different features: fc6 features from a random initialization of our architecture, AlexNet fc7 after training on labeled ImageNet, and the fc6 features learned from our method. Queries were chosen from 1000 randomly-sampled patches. The top group is examples where our algorithm performs well; for the middle AlexNet outperforms our approach; and for the bottom all three features work well.
  • Figure 5: We trained a network to predict the absolute $(x,y)$ coordinates of randomly sampled patches. Far left: input image. Center left: extracted patches. Center right: the location the trained network predicts for each patch shown on the left. Far right: the same result after our color projection scheme. Note that the far right patches are shown after color projection; the operation's effect is almost unnoticeable.
  • ...and 4 more figures