Learning Visual Features from Large Weakly Supervised Data
Armand Joulin, Laurens van der Maaten, Allan Jabri, Nicolas Vasilache
TL;DR
This paper demonstrates that convolutional networks can learn strong visual representations from massive weakly labeled data (Flickr 100M) without manual annotations. By training end-to-end on image-caption pairs with large vocabularies, the authors show competitive transfer performance against fully supervised baselines and reveal that the learned word embeddings capture semantic structure and multilingual correspondences. Key contributions include scalable training techniques (uniform per-class sampling, SGD over targets) and evidence that weakly supervised features can approach supervised feature quality on diverse vision tasks. The work also offers practical guidance for weakly supervised learning and points to future avenues combining vision with language models for multimodal understanding.
Abstract
Convolutional networks trained on large supervised dataset produce visual features which form the basis for the state-of-the-art in many computer-vision problems. Further improvements of these visual features will likely require even larger manually labeled data sets, which severely limits the pace at which progress can be made. In this paper, we explore the potential of leveraging massive, weakly-labeled image collections for learning good visual features. We train convolutional networks on a dataset of 100 million Flickr photos and captions, and show that these networks produce features that perform well in a range of vision problems. We also show that the networks appropriately capture word similarity, and learn correspondences between different languages.
