OpenCIL: Benchmarking Out-of-Distribution Detection in Class-Incremental Learning

Wenjun Miao; Guansong Pang; Trong-Tung Nguyen; Ruohang Fang; Jin Zheng; Xiao Bai

OpenCIL: Benchmarking Out-of-Distribution Detection in Class-Incremental Learning

Wenjun Miao, Guansong Pang, Trong-Tung Nguyen, Ruohang Fang, Jin Zheng, Xiao Bai

TL;DR

OpenCIL introduces the first large-scale benchmark for evaluating OOD detection within class-incremental learning (CIL), combining four CIL models with fifteen OOD detectors to form 60 baselines across CIFAR100 and ImageNet1K with six OOD datasets. It presents two frameworks for integrating OOD detectors into CIL and introduces BER, a baseline that uses New Task Energy Regularization and Old Task Energy Regularization to reduce biases toward OOD samples and newly added classes, formalized via energy-based objectives. The results reveal that higher CIL accuracy does not guarantee better OOD detection, fine-tuning-based detectors generally outperform post-hoc methods, and catastrophic forgetting affects OOD detection; BER consistently improves OOD metrics across datasets and incremental steps. The work provides practical insights for safe open-world deployment of CIL models and offers an extensible, open-source evaluation platform for continued benchmarking in this space.

Abstract

Class incremental learning (CIL) aims to learn a model that can not only incrementally accommodate new classes, but also maintain the learned knowledge of old classes. Out-of-distribution (OOD) detection in CIL is to retain this incremental learning ability, while being able to reject unknown samples that are drawn from different distributions of the learned classes. This capability is crucial to the safety of deploying CIL models in open worlds. However, despite remarkable advancements in the respective CIL and OOD detection, there lacks a systematic and large-scale benchmark to assess the capability of advanced CIL models in detecting OOD samples. To fill this gap, in this study we design a comprehensive empirical study to establish such a benchmark, named $\textbf{OpenCIL}$. To this end, we propose two principled frameworks for enabling four representative CIL models with 15 diverse OOD detection methods, resulting in 60 baseline models for OOD detection in CIL. The empirical evaluation is performed on two popular CIL datasets with six commonly-used OOD datasets. One key observation we find through our comprehensive evaluation is that the CIL models can be severely biased towards the OOD samples and newly added classes when they are exposed to open environments. Motivated by this, we further propose a new baseline for OOD detection in CIL, namely Bi-directional Energy Regularization ($\textbf{BER}$), which is specially designed to mitigate these two biases in different CIL models by having energy regularization on both old and new classes. Its superior performance is justified in our experiments. All codes and datasets are open-source at https://github.com/mala-lab/OpenCIL.

OpenCIL: Benchmarking Out-of-Distribution Detection in Class-Incremental Learning

TL;DR

Abstract

. To this end, we propose two principled frameworks for enabling four representative CIL models with 15 diverse OOD detection methods, resulting in 60 baseline models for OOD detection in CIL. The empirical evaluation is performed on two popular CIL datasets with six commonly-used OOD datasets. One key observation we find through our comprehensive evaluation is that the CIL models can be severely biased towards the OOD samples and newly added classes when they are exposed to open environments. Motivated by this, we further propose a new baseline for OOD detection in CIL, namely Bi-directional Energy Regularization (

), which is specially designed to mitigate these two biases in different CIL models by having energy regularization on both old and new classes. Its superior performance is justified in our experiments. All codes and datasets are open-source at https://github.com/mala-lab/OpenCIL.

Paper Structure (32 sections, 6 equations, 8 figures, 8 tables, 3 algorithms)

This paper contains 32 sections, 6 equations, 8 figures, 8 tables, 3 algorithms.

Introduction
Related Work
Baselines and Evaluation Protocol
Problem Statement
Baselines for OOD Detection in CIL
Adapting State-of-the-art OOD Detectors to CIL.
The Proposed Baseline: BER.
Evaluation Protocol
CIL Datasets.
OOD Datasets.
Performance Metrics.
Results and Analysis
Implementation Details
Main Results
CIL Models with Higher CIL accuracy Not Necessarily Have Better OOD Detection Performance.
...and 17 more sections

Figures (8)

Figure 1: Qualitative results of the CIL model iCaRL rebuffi2017icarl with CIFAR100 krizhevsky2009learning. (a) All four representative OOD detection methods experience a decreased AUC performance with increasing incremental steps, compared to themselves working on the full training data of all steps. (b) Mean prediction confidence of iCaRL on test samples from all incremental classes. (c) Mean prediction confidence of iCaRL classifying six OOD datasets into one of the ID classes. The results for the other CIL models are provided in in the Appendix \ref{['morequantity']}.
Figure 2: Two principled frameworks used in OpenCIL to incorporate OOD detection methods into the CIL models. Both frameworks are performed on pre-trained CIL models which are kept frozen when incorporating the OOD detection methods, and thus, their CIL performance is not affected.
Figure 3: (a) Average performance of CIL models with the OOD detector REGMIX pinto2023RegMixup on six OOD datasets at each incremental step on CIFAR100. The results for the other OOD methods are provided in the Appendix \ref{['morequantity']}. (b) Average performance of four representative OOD methods on six OOD datasets at each incremental step, where the CIL model iCaRL rebuffi2017icarl is used. ACC is the accuracy of iCaRL on CIFAR100 at each step. The results for the other three CIL models are provided in the Appendix \ref{['morequantity']}.
Figure 4: Qualitative results of the CIL model BiC wu2019large with CIFAR100 krizhevsky2009learning. (a) All four representative OOD detection methods experience a decreased AUC performance with increasing incremental steps, compared to themselves working on the full training data of all steps. (b) Mean prediction confidence of BiC on test samples from all incremental classes. (c) Mean prediction confidence of BiC classifying six OOD datasets into one of the ID classes.
Figure 5: Qualitative results of the CIL model WA zhao2020maintaining with CIFAR100 krizhevsky2009learning. (a) All four representative OOD detection methods experience a decreased AUC performance with increasing incremental steps, compared to themselves working on the full training data of all steps. (b) Mean prediction confidence of WA on test samples from all incremental classes. (c) Mean prediction confidence of WA classifying six OOD datasets into one of the ID classes.
...and 3 more figures

OpenCIL: Benchmarking Out-of-Distribution Detection in Class-Incremental Learning

TL;DR

Abstract

OpenCIL: Benchmarking Out-of-Distribution Detection in Class-Incremental Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (8)