Table of Contents
Fetching ...

Unlocking ImageNet's Multi-Object Nature: Automated Large-Scale Multilabel Annotation

Junyu Chen, Md Yousuf Harun, Christopher Kanan

TL;DR

An automated pipeline is presented to convert the ImageNet training set into a multi-label dataset, without human annotations, and labels show strong alignment with human judgment in qualitative evaluations and consistently improve performance across quantitative benchmarks.

Abstract

The original ImageNet benchmark enforces a single-label assumption, despite many images depicting multiple objects. This leads to label noise and limits the richness of the learning signal. Multi-label annotations more accurately reflect real-world visual scenes, where multiple objects co-occur and contribute to semantic understanding, enabling models to learn richer and more robust representations. While prior efforts (e.g., ReaL, ImageNetv2) have improved the validation set, there has not yet been a scalable, high-quality multi-label annotation for the training set. To this end, we present an automated pipeline to convert the ImageNet training set into a multi-label dataset, without human annotations. Using self-supervised Vision Transformers, we perform unsupervised object discovery, select regions aligned with original labels to train a lightweight classifier, and apply it to all regions to generate coherent multi-label annotations across the dataset. Our labels show strong alignment with human judgment in qualitative evaluations and consistently improve performance across quantitative benchmarks. Compared to traditional single-label scheme, models trained with our multi-label supervision achieve consistently better in-domain accuracy across architectures (up to +2.0 top-1 accuracy on ReaL and +1.5 on ImageNet-V2) and exhibit stronger transferability to downstream tasks (up to +4.2 and +2.3 mAP on COCO and VOC, respectively). These results underscore the importance of accurate multi-label annotations for enhancing both classification performance and representation learning. Project code and the generated multi-label annotations are available at https://github.com/jchen175/MultiLabel-ImageNet.

Unlocking ImageNet's Multi-Object Nature: Automated Large-Scale Multilabel Annotation

TL;DR

An automated pipeline is presented to convert the ImageNet training set into a multi-label dataset, without human annotations, and labels show strong alignment with human judgment in qualitative evaluations and consistently improve performance across quantitative benchmarks.

Abstract

The original ImageNet benchmark enforces a single-label assumption, despite many images depicting multiple objects. This leads to label noise and limits the richness of the learning signal. Multi-label annotations more accurately reflect real-world visual scenes, where multiple objects co-occur and contribute to semantic understanding, enabling models to learn richer and more robust representations. While prior efforts (e.g., ReaL, ImageNetv2) have improved the validation set, there has not yet been a scalable, high-quality multi-label annotation for the training set. To this end, we present an automated pipeline to convert the ImageNet training set into a multi-label dataset, without human annotations. Using self-supervised Vision Transformers, we perform unsupervised object discovery, select regions aligned with original labels to train a lightweight classifier, and apply it to all regions to generate coherent multi-label annotations across the dataset. Our labels show strong alignment with human judgment in qualitative evaluations and consistently improve performance across quantitative benchmarks. Compared to traditional single-label scheme, models trained with our multi-label supervision achieve consistently better in-domain accuracy across architectures (up to +2.0 top-1 accuracy on ReaL and +1.5 on ImageNet-V2) and exhibit stronger transferability to downstream tasks (up to +4.2 and +2.3 mAP on COCO and VOC, respectively). These results underscore the importance of accurate multi-label annotations for enhancing both classification performance and representation learning. Project code and the generated multi-label annotations are available at https://github.com/jchen175/MultiLabel-ImageNet.
Paper Structure (42 sections, 10 equations, 6 figures, 13 tables)

This paper contains 42 sections, 10 equations, 6 figures, 13 tables.

Figures (6)

  • Figure 1: Comparison of Existing ImageNet Train-split Relabeling Strategies with Ours. Original ImageNet imagenet annotations assume a single label per image. (a) MIIL ridnik2021imagenet adds hierarchical labels from ImageNet-21K but lacks object-level distinctions. (b) ImageNet-Segments gao2022large (IN-Seg) offers pixel masks for 9k training images with single object annotation. (c) ReLabel relabel assigns soft labels via a $15^2$ patch map, requiring crop coordinates to extract local supervision. (d) In contrast, our method generates explicit multi-label annotations with corresponding spatial masks, offering true multi-object labeling for the entire training set.
  • Figure 2: Overview of our relabeling pipeline. (a) We apply MaskCut wang2023cut on DINOv3 simeoni2025dinov3 ViT features to generate object proposals. ReLabel relabel maps are used to filter proposals most aligned with the original ground-truth label, which supervise a lightweight labeler. (b) At inference, the labeler predicts class scores for each proposal, enabling spatially grounded multi-label annotations. (c) Compared to a global classifier (e.g., EVA02 fang2024eva), ReLabel improves proposal filtering while can still produce high-confidence false positives. (d) Visualization of top-1 predictions per region shows our labeler better disambiguates multiple objects than ReLabel, avoiding context bias and recognizing distinct object categories.
  • Figure 3: Qualitative examples comparing our multi-label annotations against ImageNet and ReaL real. (a) Our method successfully corrects missing or incorrect labels from ReaL by identifying additional objects and providing improved grounding. (b) Representative failure cases, including ambiguity (e.g., notebook vs. laptop) and missed object proposals.
  • Figure 4: Comparison of mask proposals from SAMv2 ravi2024sam (two configurations) and MaskCut wang2023cut. While SAM can produce fine-grained masks under certain settings, it often over-segments or misses key objects depending on the image, highlighting its sensitivity to hyperparameters. In contrast, MaskCut generates more consistent, instance-level masks across diverse images using a fixed configuration, making it more suitable for large-scale object proposal generation.
  • Figure 5: Additional qualitative comparisons between our multi-label annotations and those from ImageNet and ReaL real. Blue panels (a-e): Examples categorized by the degree of label overlap with ReaL. Labels pruned due to low confidence are indicated by strikethrough (deleted). Green panels (f): Common failure modes, including missed part–whole relations, fine-grained confusion, and proposals outside the label space (out-of-vocabulary, OOV).
  • ...and 1 more figures