Table of Contents
Fetching ...

Learning Visual Features from Large Weakly Supervised Data

Armand Joulin, Laurens van der Maaten, Allan Jabri, Nicolas Vasilache

TL;DR

This paper demonstrates that convolutional networks can learn strong visual representations from massive weakly labeled data (Flickr 100M) without manual annotations. By training end-to-end on image-caption pairs with large vocabularies, the authors show competitive transfer performance against fully supervised baselines and reveal that the learned word embeddings capture semantic structure and multilingual correspondences. Key contributions include scalable training techniques (uniform per-class sampling, SGD over targets) and evidence that weakly supervised features can approach supervised feature quality on diverse vision tasks. The work also offers practical guidance for weakly supervised learning and points to future avenues combining vision with language models for multimodal understanding.

Abstract

Convolutional networks trained on large supervised dataset produce visual features which form the basis for the state-of-the-art in many computer-vision problems. Further improvements of these visual features will likely require even larger manually labeled data sets, which severely limits the pace at which progress can be made. In this paper, we explore the potential of leveraging massive, weakly-labeled image collections for learning good visual features. We train convolutional networks on a dataset of 100 million Flickr photos and captions, and show that these networks produce features that perform well in a range of vision problems. We also show that the networks appropriately capture word similarity, and learn correspondences between different languages.

Learning Visual Features from Large Weakly Supervised Data

TL;DR

This paper demonstrates that convolutional networks can learn strong visual representations from massive weakly labeled data (Flickr 100M) without manual annotations. By training end-to-end on image-caption pairs with large vocabularies, the authors show competitive transfer performance against fully supervised baselines and reveal that the learned word embeddings capture semantic structure and multilingual correspondences. Key contributions include scalable training techniques (uniform per-class sampling, SGD over targets) and evidence that weakly supervised features can approach supervised feature quality on diverse vision tasks. The work also offers practical guidance for weakly supervised learning and points to future avenues combining vision with language models for multimodal understanding.

Abstract

Convolutional networks trained on large supervised dataset produce visual features which form the basis for the state-of-the-art in many computer-vision problems. Further improvements of these visual features will likely require even larger manually labeled data sets, which severely limits the pace at which progress can be made. In this paper, we explore the potential of leveraging massive, weakly-labeled image collections for learning good visual features. We train convolutional networks on a dataset of 100 million Flickr photos and captions, and show that these networks produce features that perform well in a range of vision problems. We also show that the networks appropriately capture word similarity, and learn correspondences between different languages.

Paper Structure

This paper contains 8 sections, 4 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Six randomly picked photos from the Flickr 100M dataset and the corresponding descriptions.
  • Figure 2: Lefthand side: Precision@10 of by weakly supervised AlexNets trained on Flickr datasets of different sizes on a held-out test set, using $K\!=\!1,000$ (in red) and a single crop. For reference, we also show the precision@10 of logistic regression trained on features from convolutional networks trained on ImageNet with and without jittering (in blue and black, respectively). Righthand side: Mean average precision on Pascal VOC 2007 dataset obtained by logistic regressors trained on features extracted from AlexNet trained on Flickr (in red) and ImageNet with and without jittering (in blue and black). Higher values are better.
  • Figure 3: Six test images with high scores for different words. The scores were computed using an AlexNet trained on the Flickr dataset with a dictionary size of $K\!=\!100,000$.
  • Figure 4: t-SNE map of $20,000$ Flickr test images based on features extracted from the last layer of an AlexNet trained with $K\!=\!1,000$. A full-resolution map is presented in the supplemental material. The inset shows a cluster of sports.
  • Figure 5: Average classification accuracy (averaged over ten runs) of logistic regressors trained on features produced by weakly supervised AlexNets trained on Flickr image-caption datasets of different sizes on six different datasets (in red). For reference, we also show the classification accuracy of classifiers trained on features from convolutional networks trained on ImageNet without jittering (in black) and with jittering (in blue). Dashed lines indicate the standard deviation across runs. Higher values are better.
  • ...and 1 more figures