Table of Contents
Fetching ...

What Makes ImageNet Look Unlike LAION

Ali Shirali, Moritz Hardt

TL;DR

This work proposes a rigorous explanation for the discrepancy in terms of a subtle, yet important, difference in two plausible causal data-generating processes for the respective datasets, and formalizes a long-held intuition in the community that ImageNet images are stereotypical, unnatural, and overly simple representations of the class category.

Abstract

ImageNet was famously created from Flickr image search results. What if we recreated ImageNet instead by searching the massive LAION dataset based on image captions alone? In this work, we carry out this counterfactual investigation. We find that the resulting ImageNet recreation, which we call LAIONet, looks distinctly unlike the original. Specifically, the intra-class similarity of images in the original ImageNet is dramatically higher than it is for LAIONet. Consequently, models trained on ImageNet perform significantly worse on LAIONet. We propose a rigorous explanation for the discrepancy in terms of a subtle, yet important, difference in two plausible causal data-generating processes for the respective datasets, that we support with systematic experimentation. In a nutshell, searching based on an image caption alone creates an information bottleneck that mitigates the selection bias otherwise present in image-based filtering. Our explanation formalizes a long-held intuition in the community that ImageNet images are stereotypical, unnatural, and overly simple representations of the class category. At the same time, it provides a simple and actionable takeaway for future dataset creation efforts.

What Makes ImageNet Look Unlike LAION

TL;DR

This work proposes a rigorous explanation for the discrepancy in terms of a subtle, yet important, difference in two plausible causal data-generating processes for the respective datasets, and formalizes a long-held intuition in the community that ImageNet images are stereotypical, unnatural, and overly simple representations of the class category.

Abstract

ImageNet was famously created from Flickr image search results. What if we recreated ImageNet instead by searching the massive LAION dataset based on image captions alone? In this work, we carry out this counterfactual investigation. We find that the resulting ImageNet recreation, which we call LAIONet, looks distinctly unlike the original. Specifically, the intra-class similarity of images in the original ImageNet is dramatically higher than it is for LAIONet. Consequently, models trained on ImageNet perform significantly worse on LAIONet. We propose a rigorous explanation for the discrepancy in terms of a subtle, yet important, difference in two plausible causal data-generating processes for the respective datasets, that we support with systematic experimentation. In a nutshell, searching based on an image caption alone creates an information bottleneck that mitigates the selection bias otherwise present in image-based filtering. Our explanation formalizes a long-held intuition in the community that ImageNet images are stereotypical, unnatural, and overly simple representations of the class category. At the same time, it provides a simple and actionable takeaway for future dataset creation efforts.
Paper Structure (25 sections, 29 figures)

This paper contains 25 sections, 29 figures.

Figures (29)

  • Figure 1: Accuracy of ImageNet-trained models when evaluated on ImageNet validation set versus LAIONet. Three types of models are distinguished based on whether they are pre-trained on ImageNet-$22$k and whether they are fine-tuned on ImageNet-$1$k. Accuracy is defined as the average of the recalls calculated for each class that is present in LAIONet.
  • Figure 2: The suggested underlying mechanism of data generation and selection in LAIONet and ImageNet. Class $Y$, text description $T$, image $X$, selection $S$ or $S'$.
  • Figure 3: Filtering LAION samples based on their textual similarity to the candidate synsets. The dashed line shows the chosen threshold. (a) The overall probability density function (pdf) of the similarities prior to the second step of filtering. (b and c) The number of ImageNet classes covered by the dataset and the size of the dataset for different levels of similarity threshold.
  • Figure 4: Relative frequencies of different classes in LAIONet sorted in descending order for the $500$ most frequent classes. Some class names are shown. The red line shows uniform weight.
  • Figure 5: A LAION-weighted accuracy is calculated according to the relative frequency of the classes in LAIONet and compared to the accuracy with equally weighted classes.
  • ...and 24 more figures