Table of Contents
Fetching ...

ImageNot: A contrast with ImageNet preserves model rankings

Olawale Salaudeen, Moritz Hardt

TL;DR

The paper investigates whether architectural progress driven by ImageNet generalizes to a radically different yet similarly scaled dataset. By constructing ImageNot from LAION-5B with no human labeling and disjoint concepts, the authors train ImageNet-era architectures from scratch and show that both the final rankings and the relative improvements among models are preserved compared to ImageNet, despite substantial label noise and data differences. They further demonstrate that pretraining and transfer trends on ImageNot align with those on ImageNet for finetuning to CIFAR-10, though transferability is somewhat diminished due to noisier labels. The results provide strong evidence for the external validity of model development progress and suggest benchmarks should focus on ranking stability rather than absolute accuracy, and that clean labels may not be strictly necessary for meaningful benchmarking.

Abstract

We introduce ImageNot, a dataset constructed explicitly to be drastically different than ImageNet while matching its scale. ImageNot is designed to test the external validity of deep learning progress on ImageNet. We show that key model architectures developed for ImageNet over the years rank identically to how they rank on ImageNet when trained from scratch and evaluated on ImageNot. Moreover, the relative improvements of each model over earlier models strongly correlate in both datasets. Our work demonstrates a surprising degree of external validity in the relative performance of image classification models when trained and evaluated on an entirely different dataset. This stands in contrast with absolute accuracy numbers that typically drop sharply even under small changes to a dataset.

ImageNot: A contrast with ImageNet preserves model rankings

TL;DR

The paper investigates whether architectural progress driven by ImageNet generalizes to a radically different yet similarly scaled dataset. By constructing ImageNot from LAION-5B with no human labeling and disjoint concepts, the authors train ImageNet-era architectures from scratch and show that both the final rankings and the relative improvements among models are preserved compared to ImageNet, despite substantial label noise and data differences. They further demonstrate that pretraining and transfer trends on ImageNot align with those on ImageNet for finetuning to CIFAR-10, though transferability is somewhat diminished due to noisier labels. The results provide strong evidence for the external validity of model development progress and suggest benchmarks should focus on ranking stability rather than absolute accuracy, and that clean labels may not be strictly necessary for meaningful benchmarking.

Abstract

We introduce ImageNot, a dataset constructed explicitly to be drastically different than ImageNet while matching its scale. ImageNot is designed to test the external validity of deep learning progress on ImageNet. We show that key model architectures developed for ImageNet over the years rank identically to how they rank on ImageNet when trained from scratch and evaluated on ImageNot. Moreover, the relative improvements of each model over earlier models strongly correlate in both datasets. Our work demonstrates a surprising degree of external validity in the relative performance of image classification models when trained and evaluated on an entirely different dataset. This stands in contrast with absolute accuracy numbers that typically drop sharply even under small changes to a dataset.
Paper Structure (26 sections, 12 figures, 3 tables)

This paper contains 26 sections, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Final Model Ranking and Relative Improvement. Model rankings and relative improvements hold with models trained from scratch on the respective dataset, and evaluated on a held-out test set. For a set of test accuracies $X$, relative progress is $X_i / X_{\text{AlexNet}}$. ImageNet model accuracies range from 0.57 (AlexNet) to 0.85 (EfficientNet V2 L), while much noisier ImageNot model accuracies approximately range from 0.4 (AlexNet) to 0.6 (EfficientNet V2 L).
  • Figure 2: Examples from ImageNet (left) and ImageNot (right). ImageNot classes (e.g., 'cleats,' 'batter') are disjoint and distinct from ImageNet classes (e.g., 'Irish Terrier,' 'Blenheim Spaniel'). ImageNot, web-sourced data with automated data selection and labeling, also has greater natural variability and label noise.
  • Figure 3: Overview of the ImageNot dataset curation pipeline. Starting from the LAION-5B corpus, the process selects all candidate noun synsets, removes classes overlapping with ImageNet or containing too few samples, filters out NSFW or ambiguous data, and ranks remaining classes by semantic (RoBERTa) and visual (CLIP) similarity. The final stage yields 1,000 ImageNot classes matched in scale but disjoint in concept from ImageNet, enabling evaluation of external validity.
  • Figure 4: ImageNot caption x synset-gloss RoBERTa embeddings similarity for each class. We show some summary statistics of the similarity distribution across the 1000 classes.
  • Figure 5: ImageNot caption x synset-gloss RoBERTa embeddings similarity distribution for each class. Each distribution corresponds to similarities for each class.
  • ...and 7 more figures