Table of Contents
Fetching ...

Unsupervised Learning of Visual Representations using Videos

Xiaolong Wang, Abhinav Gupta

TL;DR

The paper introduces an unsupervised method to learn visual representations from hundreds of thousands of unlabeled videos by tracking patches and training a Siamese-triplet CNN with a ranking loss. Hard negative mining and model ensemble amplify performance, achieving up to 52% mAP on VOC2012 with no ImageNet data and approaching ImageNet-supervised baselines. The approach also yields competitive results on surface normal estimation, demonstrating robust generalization to structured vision tasks and offering a compelling direction for reducing reliance on semantic labels.

Abstract

Is strong supervision necessary for learning a good visual representation? Do we really need millions of semantically-labeled images to train a Convolutional Neural Network (CNN)? In this paper, we present a simple yet surprisingly powerful approach for unsupervised learning of CNN. Specifically, we use hundreds of thousands of unlabeled videos from the web to learn visual representations. Our key idea is that visual tracking provides the supervision. That is, two patches connected by a track should have similar visual representation in deep feature space since they probably belong to the same object or object part. We design a Siamese-triplet network with a ranking loss function to train this CNN representation. Without using a single image from ImageNet, just using 100K unlabeled videos and the VOC 2012 dataset, we train an ensemble of unsupervised networks that achieves 52% mAP (no bounding box regression). This performance comes tantalizingly close to its ImageNet-supervised counterpart, an ensemble which achieves a mAP of 54.4%. We also show that our unsupervised network can perform competitively in other tasks such as surface-normal estimation.

Unsupervised Learning of Visual Representations using Videos

TL;DR

The paper introduces an unsupervised method to learn visual representations from hundreds of thousands of unlabeled videos by tracking patches and training a Siamese-triplet CNN with a ranking loss. Hard negative mining and model ensemble amplify performance, achieving up to 52% mAP on VOC2012 with no ImageNet data and approaching ImageNet-supervised baselines. The approach also yields competitive results on surface normal estimation, demonstrating robust generalization to structured vision tasks and offering a compelling direction for reducing reliance on semantic labels.

Abstract

Is strong supervision necessary for learning a good visual representation? Do we really need millions of semantically-labeled images to train a Convolutional Neural Network (CNN)? In this paper, we present a simple yet surprisingly powerful approach for unsupervised learning of CNN. Specifically, we use hundreds of thousands of unlabeled videos from the web to learn visual representations. Our key idea is that visual tracking provides the supervision. That is, two patches connected by a track should have similar visual representation in deep feature space since they probably belong to the same object or object part. We design a Siamese-triplet network with a ranking loss function to train this CNN representation. Without using a single image from ImageNet, just using 100K unlabeled videos and the VOC 2012 dataset, we train an ensemble of unsupervised networks that achieves 52% mAP (no bounding box regression). This performance comes tantalizingly close to its ImageNet-supervised counterpart, an ensemble which achieves a mAP of 54.4%. We also show that our unsupervised network can perform competitively in other tasks such as surface-normal estimation.

Paper Structure

This paper contains 17 sections, 3 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Overview of our approach. (a) Given unlabeled videos, we perform unsupervised tracking on the patches in them. (b) Triplets of patches including query patch in the initial frame of tracking, tracked patch in the last frame, and random patch from other videos are fed into our siamese-triplet network for training. (c) The learning objective: Distance between the query and tracked patch in feature space should be smaller than the distance between query and random patches.
  • Figure 2: Given the video about buses (the "bus" label are not utilized), we perform IDT on it. red points represents the SURF feature points, green represents the trajectories for the points. We reject the frames with small and large camera motions (top pairs). Given the selected frame, we find the bounding box containing most of the moving SURF points. We then perform tracking. The first and last frame of the track provide pair of patches for training CNN.
  • Figure 3: Examples of patch pairs we obtain via patch mining in the videos.
  • Figure 4: Siamese-triplet network. Each base network in the Siamese-triplet network share the same architecture and parameter weights. The architecture is rectified from AlexNet by using only two fully connected layers. Given a triplet of training samples, we obtain their features from the last layer by forward propagation and compute the ranking loss.
  • Figure 5: Top response regions for the pool5 neurons of our unsupervised-CNN. Each row shows top response of one neuron.
  • ...and 3 more figures