Table of Contents
Fetching ...

WebVision Database: Visual Learning and Understanding from Web Data

Wen Li, Limin Wang, Wei Li, Eirikur Agustsson, Luc Van Gool

TL;DR

The authors present WebVision, a large-scale web image database aligned to ILSVRC 2012 concepts, to study learning from noisy web labels and domain differences. Through baseline AlexNet experiments and transfer learning to Caltech-256 and VOC2007, they show that web-derived supervision can achieve competitive or superior performance and can complement human-annotated data. The work also reveals domain bias between WebVision and ILSVRC 2012, highlighting opportunities for large-scale visual domain adaptation and meta-information integration. Overall, WebVision provides a valuable resource for advancing weakly supervised visual learning using web data.

Abstract

In this paper, we present a study on learning visual recognition models from large scale noisy web data. We build a new database called WebVision, which contains more than $2.4$ million web images crawled from the Internet by using queries generated from the 1,000 semantic concepts of the benchmark ILSVRC 2012 dataset. Meta information along with those web images (e.g., title, description, tags, etc.) are also crawled. A validation set and test set containing human annotated images are also provided to facilitate algorithmic development. Based on our new database, we obtain a few interesting observations: 1) the noisy web images are sufficient for training a good deep CNN model for visual recognition; 2) the model learnt from our WebVision database exhibits comparable or even better generalization ability than the one trained from the ILSVRC 2012 dataset when being transferred to new datasets and tasks; 3) a domain adaptation issue (a.k.a., dataset bias) is observed, which means the dataset can be used as the largest benchmark dataset for visual domain adaptation. Our new WebVision database and relevant studies in this work would benefit the advance of learning state-of-the-art visual models with minimum supervision based on web data.

WebVision Database: Visual Learning and Understanding from Web Data

TL;DR

The authors present WebVision, a large-scale web image database aligned to ILSVRC 2012 concepts, to study learning from noisy web labels and domain differences. Through baseline AlexNet experiments and transfer learning to Caltech-256 and VOC2007, they show that web-derived supervision can achieve competitive or superior performance and can complement human-annotated data. The work also reveals domain bias between WebVision and ILSVRC 2012, highlighting opportunities for large-scale visual domain adaptation and meta-information integration. Overall, WebVision provides a valuable resource for advancing weakly supervised visual learning using web data.

Abstract

In this paper, we present a study on learning visual recognition models from large scale noisy web data. We build a new database called WebVision, which contains more than million web images crawled from the Internet by using queries generated from the 1,000 semantic concepts of the benchmark ILSVRC 2012 dataset. Meta information along with those web images (e.g., title, description, tags, etc.) are also crawled. A validation set and test set containing human annotated images are also provided to facilitate algorithmic development. Based on our new database, we obtain a few interesting observations: 1) the noisy web images are sufficient for training a good deep CNN model for visual recognition; 2) the model learnt from our WebVision database exhibits comparable or even better generalization ability than the one trained from the ILSVRC 2012 dataset when being transferred to new datasets and tasks; 3) a domain adaptation issue (a.k.a., dataset bias) is observed, which means the dataset can be used as the largest benchmark dataset for visual domain adaptation. Our new WebVision database and relevant studies in this work would benefit the advance of learning state-of-the-art visual models with minimum supervision based on web data.

Paper Structure

This paper contains 12 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Examples of images from Flickr (top), Google (middle), and ImageNet (bottom). Left: "tree frog"; right: "car wheel"
  • Figure 2: Number of images per category of the WebVison dataset.
  • Figure 3: Examples of image meta information from Flickr and Google. The meta-information associated with these two images is: (a) title: "Brambling"; description: "Brambling - Fringilla montifringilla Russia, Moscow region, Saltykovka, 10/13/2007"; tags: "Brambling", "Fringilla montifringilla"; (b) title: "High Quality Stock Photos of brambling"; description:"Brambling, male, North Rhine-Westphalia, Germany / (Fringilla montifringilla) /".
  • Figure 4: Number of inlier images among 200 images per category of the WebVison dataset, sorted by number of "3 votes" images in descend order.
  • Figure 5: Classification accuracy (%) on WebVision validation set when using different percentages of images in the WebVision dataset and the ILSVRC 2012 dataset.
  • ...and 1 more figures