Table of Contents
Fetching ...

SOOD-ImageNet: a Large-Scale Dataset for Semantic Out-Of-Distribution Image Classification and Semantic Segmentation

Alberto Bacchin, Davide Allegro, Stefano Ghidoni, Emanuele Menegatti

TL;DR

SOOD-ImageNet introduces a large-scale, semantically perturbed benchmark for semantic Out-Of-Distribution (SOOD) generalization in both image classification and semantic segmentation. It employs a novel data engine that combines language hierarchies, Vision-Language Models, CLIP-based scoring, and targeted human verification to create IID and two OOD splits (Easy and Hard) from ImageNet-21K-P, yielding approximately 1.6M images across 56 super-classes. Experimental results show that state-of-the-art DL models and large foundation models struggle with semantic shifts, with modest gains from data augmentation or pre-training, underscoring the challenge of SOOD generalization. The dataset and methodology enable scalable evaluation of SOOD across tasks and motivate future work on richer partitions, broader coverage, and potential OSR applications.

Abstract

Out-of-Distribution (OOD) detection in computer vision is a crucial research area, with related benchmarks playing a vital role in assessing the generalizability of models and their applicability in real-world scenarios. However, existing OOD benchmarks in the literature suffer from two main limitations: (1) they often overlook semantic shift as a potential challenge, and (2) their scale is limited compared to the large datasets used to train modern models. To address these gaps, we introduce SOOD-ImageNet, a novel dataset comprising around 1.6M images across 56 classes, designed for common computer vision tasks such as image classification and semantic segmentation under OOD conditions, with a particular focus on the issue of semantic shift. We ensured the necessary scalability and quality by developing an innovative data engine that leverages the capabilities of modern vision-language models, complemented by accurate human checks. Through extensive training and evaluation of various models on SOOD-ImageNet, we showcase its potential to significantly advance OOD research in computer vision. The project page is available at https://github.com/bach05/SOODImageNet.git.

SOOD-ImageNet: a Large-Scale Dataset for Semantic Out-Of-Distribution Image Classification and Semantic Segmentation

TL;DR

SOOD-ImageNet introduces a large-scale, semantically perturbed benchmark for semantic Out-Of-Distribution (SOOD) generalization in both image classification and semantic segmentation. It employs a novel data engine that combines language hierarchies, Vision-Language Models, CLIP-based scoring, and targeted human verification to create IID and two OOD splits (Easy and Hard) from ImageNet-21K-P, yielding approximately 1.6M images across 56 super-classes. Experimental results show that state-of-the-art DL models and large foundation models struggle with semantic shifts, with modest gains from data augmentation or pre-training, underscoring the challenge of SOOD generalization. The dataset and methodology enable scalable evaluation of SOOD across tasks and motivate future work on richer partitions, broader coverage, and potential OSR applications.

Abstract

Out-of-Distribution (OOD) detection in computer vision is a crucial research area, with related benchmarks playing a vital role in assessing the generalizability of models and their applicability in real-world scenarios. However, existing OOD benchmarks in the literature suffer from two main limitations: (1) they often overlook semantic shift as a potential challenge, and (2) their scale is limited compared to the large datasets used to train modern models. To address these gaps, we introduce SOOD-ImageNet, a novel dataset comprising around 1.6M images across 56 classes, designed for common computer vision tasks such as image classification and semantic segmentation under OOD conditions, with a particular focus on the issue of semantic shift. We ensured the necessary scalability and quality by developing an innovative data engine that leverages the capabilities of modern vision-language models, complemented by accurate human checks. Through extensive training and evaluation of various models on SOOD-ImageNet, we showcase its potential to significantly advance OOD research in computer vision. The project page is available at https://github.com/bach05/SOODImageNet.git.
Paper Structure (12 sections, 7 figures, 2 tables)

This paper contains 12 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Example of images taken from the proposed SOOD-ImageNet. It can be noted the increasing semantic shift from the train to the test "Hard" data.
  • Figure 2: Comparison of different datasets for OOD in computer vision
  • Figure 3: Pipeline of the SOOD-ImageNet dataset creation. Our data engine starts from ImageNet-21K-P and create a hierarchical structure using the semantics of language. Then VLMs are applied to filter, relabel and score the data, alongside with human checks.
  • Figure 4: Example images from SOOD-ImageNet. Four classes (i.e. bag, car, coffee and plane) with the corresponding classification (green) and segmentation labels are represented. It also possible to appreciate the semantic shift between $\mathcal{C}_{iid}$ (train), $\mathcal{C}^{E}_{ood}$ (test "Easy") and $\mathcal{C}^{H}_{ood}$ (test "Hard"). The original Imagenet-21K-P classes of the samples are reported in red.
  • Figure 5: The graph compares various models in terms of performance (F1 score) and number of parameters, when tested on $\mathcal{C}^{E}_{ood}$ (Easy) and $\mathcal{C}^{H}_{ood}$ (Hard) for image classification. We highlighted the gap between Easy and Hard tests with a dotted line. Pre-trained VLMs are in yellow.
  • ...and 2 more figures