Table of Contents
Fetching ...

Building high-level features using large scale unsupervised learning

Quoc V. Le, Marc'Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S. Corrado, Jeff Dean, Andrew Y. Ng

TL;DR

This work addresses whether high-level, class-specific detectors can be learned from unlabeled data and demonstrates that a large-scale, nine-layer, locally connected autoencoder trained on 10 million unlabeled 200×200 images can discover detectors for faces, cat faces, and human bodies. The authors scale up unsupervised learning using local receptive fields, L2 pooling, and local contrast normalization, trained with asynchronous SGD and model parallelism across a thousand-machine cluster, yielding 1×10^9 parameters. These detectors exhibit translation, scale, and out-of-plane rotation invariances and can transfer to supervised tasks, achieving a 15.8% accuracy on ImageNet with 22K categories (and strong gains over prior baselines), illustrating the practical value of unlabeled data for high-level feature learning. Overall, the work suggests baby-like unsupervised learning can yield semantically meaningful, high-level detectors that improve downstream recognition, aided by scalable distributed training frameworks like DistBelief.

Abstract

We consider the problem of building high-level, class-specific feature detectors from only unlabeled data. For example, is it possible to learn a face detector using only unlabeled images? To answer this, we train a 9-layered locally connected sparse autoencoder with pooling and local contrast normalization on a large dataset of images (the model has 1 billion connections, the dataset has 10 million 200x200 pixel images downloaded from the Internet). We train this network using model parallelism and asynchronous SGD on a cluster with 1,000 machines (16,000 cores) for three days. Contrary to what appears to be a widely-held intuition, our experimental results reveal that it is possible to train a face detector without having to label images as containing a face or not. Control experiments show that this feature detector is robust not only to translation but also to scaling and out-of-plane rotation. We also find that the same network is sensitive to other high-level concepts such as cat faces and human bodies. Starting with these learned features, we trained our network to obtain 15.8% accuracy in recognizing 20,000 object categories from ImageNet, a leap of 70% relative improvement over the previous state-of-the-art.

Building high-level features using large scale unsupervised learning

TL;DR

This work addresses whether high-level, class-specific detectors can be learned from unlabeled data and demonstrates that a large-scale, nine-layer, locally connected autoencoder trained on 10 million unlabeled 200×200 images can discover detectors for faces, cat faces, and human bodies. The authors scale up unsupervised learning using local receptive fields, L2 pooling, and local contrast normalization, trained with asynchronous SGD and model parallelism across a thousand-machine cluster, yielding 1×10^9 parameters. These detectors exhibit translation, scale, and out-of-plane rotation invariances and can transfer to supervised tasks, achieving a 15.8% accuracy on ImageNet with 22K categories (and strong gains over prior baselines), illustrating the practical value of unlabeled data for high-level feature learning. Overall, the work suggests baby-like unsupervised learning can yield semantically meaningful, high-level detectors that improve downstream recognition, aided by scalable distributed training frameworks like DistBelief.

Abstract

We consider the problem of building high-level, class-specific feature detectors from only unlabeled data. For example, is it possible to learn a face detector using only unlabeled images? To answer this, we train a 9-layered locally connected sparse autoencoder with pooling and local contrast normalization on a large dataset of images (the model has 1 billion connections, the dataset has 10 million 200x200 pixel images downloaded from the Internet). We train this network using model parallelism and asynchronous SGD on a cluster with 1,000 machines (16,000 cores) for three days. Contrary to what appears to be a widely-held intuition, our experimental results reveal that it is possible to train a face detector without having to label images as containing a face or not. Control experiments show that this feature detector is robust not only to translation but also to scaling and out-of-plane rotation. We also find that the same network is sensitive to other high-level concepts such as cat faces and human bodies. Starting with these learned features, we trained our network to obtain 15.8% accuracy in recognizing 20,000 object categories from ImageNet, a leap of 70% relative improvement over the previous state-of-the-art.

Paper Structure

This paper contains 28 sections, 2 equations, 16 figures, 2 tables.

Figures (16)

  • Figure 1: The architecture and parameters in one layer of our network. The overall network replicates this structure three times. For simplicity, the images are in 1D.
  • Figure 2: Histograms of faces (red) vs. no faces (blue). The test set is subsampled such that the ratio between faces and no faces is one.
  • Figure 3: Top: Top 48 stimuli of the best neuron from the test set. Bottom: The optimal stimulus according to numerical constraint optimization.
  • Figure 4: Scale (left) and out-of-plane (3D) rotation (right) invariance properties of the best feature.
  • Figure 5: Translational invariance properties of the best feature. x-axis is in pixels
  • ...and 11 more figures