Table of Contents
Fetching ...

The Impact of the Single-Label Assumption in Image Recognition Benchmarking

Esla Timothy Anzaku, Seyed Amir Mousavi, Arnout Van Messem, Wesley De Neve

TL;DR

This work addresses the disconnect between the common single-label evaluation paradigm and the inherently multi-label nature of many ImageNet images, which can distort conclusions about model robustness when assessed on ImageNetV2. The authors introduce a principled multi-label prediction framework, including variable top-$k$ evaluation, ASMA as an aggregate multi-label metric, and PatchML, a synthetic diagnostic dataset that isolates object recognition from contextual cues. Across 315 ImageNet-pretrained models, they show that top-$1$ accuracy substantially overstates gaps between ImageNetV1 and ImageNetV2, while ReaL and especially ASMA reveal much smaller or negligible degradation; PatchML further confirms substantial latent Multi-Label Prediction Capability (MLPC) even for single-label-trained models. The findings argue for multi-label-aware benchmarking to accurately reflect real-world visual understanding, and they identify training strategies and architectural features that promote robust MLPC, with practical implications for model selection and reliability in vision systems.

Abstract

Deep neural networks (DNNs) are typically evaluated under the assumption that each image has a single correct label. However, many images in benchmarks like ImageNet contain multiple valid labels, creating a mismatch between evaluation protocols and the actual complexity of visual data. This mismatch can penalize DNNs for predicting correct but unannotated labels, which may partly explain reported accuracy drops, such as the widely cited 11 to 14 percent top-1 accuracy decline on ImageNetV2, a replication test set for ImageNet. This raises the question: do such drops reflect genuine generalization failures or artifacts of restrictive evaluation metrics? We rigorously assess the impact of multi-label characteristics on reported accuracy gaps. To evaluate the multi-label prediction capability (MLPC) of single-label-trained models, we introduce a variable top-$k$ evaluation, where $k$ matches the number of valid labels per image. Applied to 315 ImageNet-trained models, our analyses demonstrate that conventional top-1 accuracy disproportionately penalizes valid but secondary predictions. We also propose Aggregate Subgroup Model Accuracy (ASMA) to better capture multi-label performance across model subgroups. Our results reveal wide variability in MLPC, with some models consistently ranking multiple correct labels higher. Under this evaluation, the perceived gap between ImageNet and ImageNetV2 narrows substantially. To further isolate multi-label recognition performance from contextual cues, we introduce PatchML, a synthetic dataset containing systematically combined object patches. PatchML demonstrates that many models trained with single-label supervision nonetheless recognize multiple objects. Altogether, these findings highlight limitations in single-label evaluation and reveal that modern DNNs have stronger multi-label capabilities than standard metrics suggest.

The Impact of the Single-Label Assumption in Image Recognition Benchmarking

TL;DR

This work addresses the disconnect between the common single-label evaluation paradigm and the inherently multi-label nature of many ImageNet images, which can distort conclusions about model robustness when assessed on ImageNetV2. The authors introduce a principled multi-label prediction framework, including variable top- evaluation, ASMA as an aggregate multi-label metric, and PatchML, a synthetic diagnostic dataset that isolates object recognition from contextual cues. Across 315 ImageNet-pretrained models, they show that top- accuracy substantially overstates gaps between ImageNetV1 and ImageNetV2, while ReaL and especially ASMA reveal much smaller or negligible degradation; PatchML further confirms substantial latent Multi-Label Prediction Capability (MLPC) even for single-label-trained models. The findings argue for multi-label-aware benchmarking to accurately reflect real-world visual understanding, and they identify training strategies and architectural features that promote robust MLPC, with practical implications for model selection and reliability in vision systems.

Abstract

Deep neural networks (DNNs) are typically evaluated under the assumption that each image has a single correct label. However, many images in benchmarks like ImageNet contain multiple valid labels, creating a mismatch between evaluation protocols and the actual complexity of visual data. This mismatch can penalize DNNs for predicting correct but unannotated labels, which may partly explain reported accuracy drops, such as the widely cited 11 to 14 percent top-1 accuracy decline on ImageNetV2, a replication test set for ImageNet. This raises the question: do such drops reflect genuine generalization failures or artifacts of restrictive evaluation metrics? We rigorously assess the impact of multi-label characteristics on reported accuracy gaps. To evaluate the multi-label prediction capability (MLPC) of single-label-trained models, we introduce a variable top- evaluation, where matches the number of valid labels per image. Applied to 315 ImageNet-trained models, our analyses demonstrate that conventional top-1 accuracy disproportionately penalizes valid but secondary predictions. We also propose Aggregate Subgroup Model Accuracy (ASMA) to better capture multi-label performance across model subgroups. Our results reveal wide variability in MLPC, with some models consistently ranking multiple correct labels higher. Under this evaluation, the perceived gap between ImageNet and ImageNetV2 narrows substantially. To further isolate multi-label recognition performance from contextual cues, we introduce PatchML, a synthetic dataset containing systematically combined object patches. PatchML demonstrates that many models trained with single-label supervision nonetheless recognize multiple objects. Altogether, these findings highlight limitations in single-label evaluation and reveal that modern DNNs have stronger multi-label capabilities than standard metrics suggest.

Paper Structure

This paper contains 35 sections, 3 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Examples from ImageNetV2 showing the top-5 predictions of a pre-trained DNN model. The percentages indicate softmax scores. Ground-truth labels and correct predictions are in green; incorrect predictions are in red. While model predictions capture image complexities, evaluating models based solely on a single ground-truth label may obscure multi-label characteristics and underestimate effectiveness.
  • Figure 2: An illustration of the PatchML dataset creation process, organized into two main stages: (1) Patch Extraction and Aggregation, where object regions are cropped and pooled; and (2) Multi-label Image Generation, where a predefined number of patches are randomly sampled without replacement and placed on blank backgrounds to create new multi-label images with corresponding ground-truth label sets.
  • Figure 3: Distribution of images by number of ground-truth labels in ImageNetV1 and ImageNetV2, with images having more than five labels grouped as "$>5$".
  • Figure 4:
  • Figure 5:
  • ...and 5 more figures