Table of Contents
Fetching ...

Open-World Continual Learning: Unifying Novelty Detection and Continual Learning

Gyuhak Kim, Changnan Xiao, Tatsuya Konishi, Zixuan Ke, Bing Liu

TL;DR

A theoretical proof that good OOD detection for each task within the set of learned tasks (called closed-world OOD detection) is necessary for successful CIL and that the theory can be generalized or extended to open-world CIL, which is the proposed open-world continual learning.

Abstract

As AI agents are increasingly used in the real open world with unknowns or novelties, they need the ability to (1) recognize objects that (a) they have learned before and (b) detect items that they have never seen or learned, and (2) learn the new items incrementally to become more and more knowledgeable and powerful. (1) is called novelty detection or out-of-distribution (OOD) detection and (2) is called class incremental learning (CIL), which is a setting of continual learning (CL). In existing research, OOD detection and CIL are regarded as two completely different problems. This paper first provides a theoretical proof that good OOD detection for each task within the set of learned tasks (called closed-world OOD detection) is necessary for successful CIL. We show this by decomposing CIL into two sub-problems: within-task prediction (WP) and task-id prediction (TP), and proving that TP is correlated with closed-world OOD detection. The key theoretical result is that regardless of whether WP and OOD detection (or TP) are defined explicitly or implicitly by a CIL algorithm, good WP and good closed-world OOD detection are necessary and sufficient conditions for good CIL, which unifies novelty or OOD detection and continual learning (CIL, in particular). We call this traditional CIL the closed-world CIL as it does not detect future OOD data in the open world. The paper then proves that the theory can be generalized or extended to open-world CIL, which is the proposed open-world continual learning, that can perform CIL in the open world and detect future or open-world OOD data. Based on the theoretical results, new CIL methods are also designed, which outperform strong baselines in CIL accuracy and in continual OOD detection by a large margin.

Open-World Continual Learning: Unifying Novelty Detection and Continual Learning

TL;DR

A theoretical proof that good OOD detection for each task within the set of learned tasks (called closed-world OOD detection) is necessary for successful CIL and that the theory can be generalized or extended to open-world CIL, which is the proposed open-world continual learning.

Abstract

As AI agents are increasingly used in the real open world with unknowns or novelties, they need the ability to (1) recognize objects that (a) they have learned before and (b) detect items that they have never seen or learned, and (2) learn the new items incrementally to become more and more knowledgeable and powerful. (1) is called novelty detection or out-of-distribution (OOD) detection and (2) is called class incremental learning (CIL), which is a setting of continual learning (CL). In existing research, OOD detection and CIL are regarded as two completely different problems. This paper first provides a theoretical proof that good OOD detection for each task within the set of learned tasks (called closed-world OOD detection) is necessary for successful CIL. We show this by decomposing CIL into two sub-problems: within-task prediction (WP) and task-id prediction (TP), and proving that TP is correlated with closed-world OOD detection. The key theoretical result is that regardless of whether WP and OOD detection (or TP) are defined explicitly or implicitly by a CIL algorithm, good WP and good closed-world OOD detection are necessary and sufficient conditions for good CIL, which unifies novelty or OOD detection and continual learning (CIL, in particular). We call this traditional CIL the closed-world CIL as it does not detect future OOD data in the open world. The paper then proves that the theory can be generalized or extended to open-world CIL, which is the proposed open-world continual learning, that can perform CIL in the open world and detect future or open-world OOD data. Based on the theoretical results, new CIL methods are also designed, which outperform strong baselines in CIL accuracy and in continual OOD detection by a large margin.
Paper Structure (46 sections, 7 theorems, 80 equations, 5 figures, 13 tables, 2 algorithms)

This paper contains 46 sections, 7 theorems, 80 equations, 5 figures, 13 tables, 2 algorithms.

Key Result

Theorem 1

If $H_{TP}(x) \leq \delta$ and $H_{WP}(x) \leq \epsilon$, we have $H_{CIL} (x) \leq \epsilon + \delta.$

Figures (5)

  • Figure 1: Overview of prediction and training framework of HAT+CSI and Sup+CSI. (a) HAT+CSI: The CIL prediction is made by argmax over the concatenated output from each task. The training of each task uses CSI. That is, the training batch is augmented to give different views of the samples for contrastive training. The training consists of two steps following CSI. The first step learns the feature extractor by using the hard attention algorithm Serra2018overcoming, which applies task embeddings to find hard masks at each layer. Then given the learned feature representations, it fine-tunes the classifier in step 2. (b) Sup+CSI: The CIL prediction is also made by taking argmax over the concatenated output values from each task as HAT+CSI. The model training for each task is similar to HAT+CSI except that it uses the Edge Popup algorithm of SupSup ramanujan2020s for finding a sparse network for each task. The sparse networks are indicated by edges of different colors in the diagram. The second step fine-tunes the classifier only with the fixed feature extractor.
  • Figure 2: (a) We train the feature extractor and the task classifier $k$ at task $k$. The output values of the classifier correspond to $|\mathbf{Y}_k| + 1$ classes, in which the last class is for OOD (i.e., representing previous and unseen future classes). At inference/testing, the probability values of each task model without the OOD class are concatenated and the system chooses the class with the maximum score. (b) Transformer and adapter module. The masked adapter network consists of 2 fully connected layers and task-specific masks. During training, only the masked adapters and norm layers are updated and the other parts in the transformer layers remain unchanged.
  • Figure 3: Average forgetting rate (%). The lower the rate, the better the method is.
  • Figure 4: Average forgetting rate (%). The lower the value, the better the method is on forgetting.
  • Figure 5: AUC of the continually trained models following the final task: (a) We use all 10 classes learned from 5 tasks of CIFAR-10 as IND and consider LSUN, CIFAR-10, and CIFAR-100 as OOD. (b) We use all 200 classes learned from 10 tasks of Tiny-ImageNet as IND and consider LSUN, CIFAR-10, and CIFAR-100 as OOD.

Theorems & Definitions (19)

  • Remark 1
  • Remark 2
  • Remark 3
  • Remark 4
  • Theorem 1
  • Corollary 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Remark 5
  • ...and 9 more