Table of Contents
Fetching ...

Joint Out-of-Distribution Filtering and Data Discovery Active Learning

Sebastian Schmidt, Leonard Schenk, Leo Schwinn, Stephan Günnemann

TL;DR

This work tackles active learning under open-world data by addressing both out-of-distribution (OOD) contamination and the discovery of new categories (OSDAL). It introduces Joint Out-of-Distribution Filtering and Data Discovery Active Learning (Joda), a single-model framework that simultaneously separates InD, near-OOD/discoverable, and far-OOD samples without auxiliary models or unlabeled-pool access. The training phase uses a combined loss $ ext{L}(b)= ext{L}_{CE}(b_{InD}) + ext{--} olimits oldsymbol{ ext{lambda}}_{ ext{OE}} ext{L}_{OE}(b_{OOD})$, while OOD filtering relies on an energy score $E(x) = - abla ext{log} extstyleig( extstyleig(ig)ig)$ and a threshold $t_{opt}$ chosen by ROC analysis and Youden’s statistic; selection is performed with the SISOMe metric and a class-balancing mechanism. The approach is validated on CIFAR-10/100 and TinyImageNet across diverse OOD schemes, showing that Joda achieves the best accuracy, rapid class discovery, and near-perfect selection precision compared to eight baselines. The results demonstrate Joda’s robustness to varying data splits and its practical applicability for real-world open-world vision tasks. Overall, the paper contributes a novel OS DAL framework and a lightweight, effective AL method that avoids extra models while delivering strong results in open-world settings.

Abstract

As the data demand for deep learning models increases, active learning (AL) becomes essential to strategically select samples for labeling, which maximizes data efficiency and reduces training costs. Real-world scenarios necessitate the consideration of incomplete data knowledge within AL. Prior works address handling out-of-distribution (OOD) data, while another research direction has focused on category discovery. However, a combined analysis of real-world considerations combining AL with out-of-distribution data and category discovery remains unexplored. To address this gap, we propose Joint Out-of-distribution filtering and data Discovery Active learning (Joda) , to uniquely address both challenges simultaneously by filtering out OOD data before selecting candidates for labeling. In contrast to previous methods, we deeply entangle the training procedure with filter and selection to construct a common feature space that aligns known and novel categories while separating OOD samples. Unlike previous works, Joda is highly efficient and completely omits auxiliary models and training access to the unlabeled pool for filtering or selection. In extensive experiments on 18 configurations and 3 metrics, \ours{} consistently achieves the highest accuracy with the best class discovery to OOD filtering balance compared to state-of-the-art competitor approaches.

Joint Out-of-Distribution Filtering and Data Discovery Active Learning

TL;DR

This work tackles active learning under open-world data by addressing both out-of-distribution (OOD) contamination and the discovery of new categories (OSDAL). It introduces Joint Out-of-Distribution Filtering and Data Discovery Active Learning (Joda), a single-model framework that simultaneously separates InD, near-OOD/discoverable, and far-OOD samples without auxiliary models or unlabeled-pool access. The training phase uses a combined loss , while OOD filtering relies on an energy score and a threshold chosen by ROC analysis and Youden’s statistic; selection is performed with the SISOMe metric and a class-balancing mechanism. The approach is validated on CIFAR-10/100 and TinyImageNet across diverse OOD schemes, showing that Joda achieves the best accuracy, rapid class discovery, and near-perfect selection precision compared to eight baselines. The results demonstrate Joda’s robustness to varying data splits and its practical applicability for real-world open-world vision tasks. Overall, the paper contributes a novel OS DAL framework and a lightweight, effective AL method that avoids extra models while delivering strong results in open-world settings.

Abstract

As the data demand for deep learning models increases, active learning (AL) becomes essential to strategically select samples for labeling, which maximizes data efficiency and reduces training costs. Real-world scenarios necessitate the consideration of incomplete data knowledge within AL. Prior works address handling out-of-distribution (OOD) data, while another research direction has focused on category discovery. However, a combined analysis of real-world considerations combining AL with out-of-distribution data and category discovery remains unexplored. To address this gap, we propose Joint Out-of-distribution filtering and data Discovery Active learning (Joda) , to uniquely address both challenges simultaneously by filtering out OOD data before selecting candidates for labeling. In contrast to previous methods, we deeply entangle the training procedure with filter and selection to construct a common feature space that aligns known and novel categories while separating OOD samples. Unlike previous works, Joda is highly efficient and completely omits auxiliary models and training access to the unlabeled pool for filtering or selection. In extensive experiments on 18 configurations and 3 metrics, \ours{} consistently achieves the highest accuracy with the best class discovery to OOD filtering balance compared to state-of-the-art competitor approaches.

Paper Structure

This paper contains 16 sections, 4 equations, 13 figures, 1 table.

Figures (13)

  • Figure 1: Overview of the Open-Set Discovery Active Learning cycle. Starting with the labeled pool (1) for training a model (2) used to select data from an unlabeled pool (3). Contrasting with previous works, it comprises three subsets: known classes (3a), novel discoverable classes (3b), and unwanted OOD data (3c). After selection, the cycles closed with annotation (4).
  • Figure 2: Joint Out-of-Distribution Filtering and Data Discovery Active Learning, comprising of the training phase (I) combining classification and outlier exposures loss followed by the filtering (II) and selection phase (III). For the filtering, a threshold is estimated on $L$(IIa) to separate OOD samples based on their energy value (IIb). Subsequently, samples are selected based on the SISOMe Schmidt2024 metrics (IIIa) combined with a class balancing (IIIb).
  • Figure 3: Comparison for CIFAR-100 with ResNet18 and indicated standard errors. From top to bottom: Mean Accuracy, Class Detection, and Selection Precision. OOD datasets from left to right: Random, MNIST, and Places365
  • Figure 4: Comparison for TinyImageNet with ResNet18 and indicated standard errors. From top to bottom: Accuracy, Class Detection, and Selection Precision. OOD datasets from left to right: MNIST, ImageNetC-800, and Places365.
  • Figure 5: Ablation study on Joda using ResNet18 and CIFAR-100 and Places365 with indicated standard errors.
  • ...and 8 more figures