Table of Contents
Fetching ...

Automated Classification of Model Errors on ImageNet

Momchil Peychev, Mark Niklas Müller, Marc Fischer, Martin Vechev

TL;DR

The paper introduces an automated error-classification framework to analyze remaining ImageNet classification errors across 962 models, addressing label noise and ambiguity that undermine top-1 metrics. It systematically categorizes errors into four categories from Vasudevan et al. and a residual model-failure set, using 161 fine-grained superclasses and an open-world classifier to detect fine-grained OOV errors. Key findings show that higher multi-label accuracy correlates with a rapid decline in severe errors, with larger pretraining data accelerating this trend, while organisms exhibit different error dynamics than artifacts, especially in spurious correlations and FG/OOV errors. The approach aligns well with human expert judgments, enabling scalable, repeatable evaluation and providing actionable insights for model development and training data design.

Abstract

While the ImageNet dataset has been driving computer vision research over the past decade, significant label noise and ambiguity have made top-1 accuracy an insufficient measure of further progress. To address this, new label-sets and evaluation protocols have been proposed for ImageNet showing that state-of-the-art models already achieve over 95% accuracy and shifting the focus on investigating why the remaining errors persist. Recent work in this direction employed a panel of experts to manually categorize all remaining classification errors for two selected models. However, this process is time-consuming, prone to inconsistencies, and requires trained experts, making it unsuitable for regular model evaluation thus limiting its utility. To overcome these limitations, we propose the first automated error classification framework, a valuable tool to study how modeling choices affect error distributions. We use our framework to comprehensively evaluate the error distribution of over 900 models. Perhaps surprisingly, we find that across model architectures, scales, and pre-training corpora, top-1 accuracy is a strong predictor for the portion of all error types. In particular, we observe that the portion of severe errors drops significantly with top-1 accuracy indicating that, while it underreports a model's true performance, it remains a valuable performance metric. We release all our code at https://github.com/eth-sri/automated-error-analysis .

Automated Classification of Model Errors on ImageNet

TL;DR

The paper introduces an automated error-classification framework to analyze remaining ImageNet classification errors across 962 models, addressing label noise and ambiguity that undermine top-1 metrics. It systematically categorizes errors into four categories from Vasudevan et al. and a residual model-failure set, using 161 fine-grained superclasses and an open-world classifier to detect fine-grained OOV errors. Key findings show that higher multi-label accuracy correlates with a rapid decline in severe errors, with larger pretraining data accelerating this trend, while organisms exhibit different error dynamics than artifacts, especially in spurious correlations and FG/OOV errors. The approach aligns well with human expert judgments, enabling scalable, repeatable evaluation and providing actionable insights for model development and training data design.

Abstract

While the ImageNet dataset has been driving computer vision research over the past decade, significant label noise and ambiguity have made top-1 accuracy an insufficient measure of further progress. To address this, new label-sets and evaluation protocols have been proposed for ImageNet showing that state-of-the-art models already achieve over 95% accuracy and shifting the focus on investigating why the remaining errors persist. Recent work in this direction employed a panel of experts to manually categorize all remaining classification errors for two selected models. However, this process is time-consuming, prone to inconsistencies, and requires trained experts, making it unsuitable for regular model evaluation thus limiting its utility. To overcome these limitations, we propose the first automated error classification framework, a valuable tool to study how modeling choices affect error distributions. We use our framework to comprehensively evaluate the error distribution of over 900 models. Perhaps surprisingly, we find that across model architectures, scales, and pre-training corpora, top-1 accuracy is a strong predictor for the portion of all error types. In particular, we observe that the portion of severe errors drops significantly with top-1 accuracy indicating that, while it underreports a model's true performance, it remains a valuable performance metric. We release all our code at https://github.com/eth-sri/automated-error-analysis .
Paper Structure (41 sections, 41 figures, 2 tables)

This paper contains 41 sections, 41 figures, 2 tables.

Figures (41)

  • Figure 1: We first remove errors w.r.t. the original ImageNet labels caused by overlapping class definitions or missing multi-label annotations, yielding multi-label accuracy (MLA). We then, in this order, identify fine-grained misclassifications, fine-grained misclassifications where the true label of the main entity is not included in the ImageNet labelset, non-prototypical examples of a given class, and spurious correlations. This leaves us with severe model failures that are unexplained by our categorization.
  • Figure 2: Venn-Diagram of the tusker, indian elephant, and african elephant classes.
  • Figure 3: Portion (left) and number (right) of top-1 errors caused by class overlap by group -- organisms (green) and artifacts (red). A 95% confidence interval linear fit is shown on the right.
  • Figure 4: Image with label ox, but also showing the classes barn and fence. Example from YunOHHCC21
  • Figure 5: Portion (left) and number (right) of top-1 errors caused by missing multi-label annotations by group -- organisms (green) and artifacts (red).
  • ...and 36 more figures