ImageNot: A contrast with ImageNet preserves model rankings
Olawale Salaudeen, Moritz Hardt
TL;DR
The paper investigates whether architectural progress driven by ImageNet generalizes to a radically different yet similarly scaled dataset. By constructing ImageNot from LAION-5B with no human labeling and disjoint concepts, the authors train ImageNet-era architectures from scratch and show that both the final rankings and the relative improvements among models are preserved compared to ImageNet, despite substantial label noise and data differences. They further demonstrate that pretraining and transfer trends on ImageNot align with those on ImageNet for finetuning to CIFAR-10, though transferability is somewhat diminished due to noisier labels. The results provide strong evidence for the external validity of model development progress and suggest benchmarks should focus on ranking stability rather than absolute accuracy, and that clean labels may not be strictly necessary for meaningful benchmarking.
Abstract
We introduce ImageNot, a dataset constructed explicitly to be drastically different than ImageNet while matching its scale. ImageNot is designed to test the external validity of deep learning progress on ImageNet. We show that key model architectures developed for ImageNet over the years rank identically to how they rank on ImageNet when trained from scratch and evaluated on ImageNot. Moreover, the relative improvements of each model over earlier models strongly correlate in both datasets. Our work demonstrates a surprising degree of external validity in the relative performance of image classification models when trained and evaluated on an entirely different dataset. This stands in contrast with absolute accuracy numbers that typically drop sharply even under small changes to a dataset.
