Class Incremental Learning via Likelihood Ratio Based Task Prediction

Haowei Lin; Yijia Shao; Weinan Qian; Ningxin Pan; Yiduo Guo; Bing Liu

Class Incremental Learning via Likelihood Ratio Based Task Prediction

Haowei Lin, Yijia Shao, Weinan Qian, Ningxin Pan, Yiduo Guo, Bing Liu

TL;DR

This paper argues that using a traditional OOD detector for task-id prediction is sub-optimal because additional information available in CIL can be exploited to design a better and principled method for task-id prediction.

Abstract

Class incremental learning (CIL) is a challenging setting of continual learning, which learns a series of tasks sequentially. Each task consists of a set of unique classes. The key feature of CIL is that no task identifier (or task-id) is provided at test time. Predicting the task-id for each test sample is a challenging problem. An emerging theory-guided approach (called TIL+OOD) is to train a task-specific model for each task in a shared network for all tasks based on a task-incremental learning (TIL) method to deal with catastrophic forgetting. The model for each task is an out-of-distribution (OOD) detector rather than a conventional classifier. The OOD detector can perform both within-task (in-distribution (IND)) class prediction and OOD detection. The OOD detection capability is the key to task-id prediction during inference. However, this paper argues that using a traditional OOD detector for task-id prediction is sub-optimal because additional information (e.g., the replay data and the learned tasks) available in CIL can be exploited to design a better and principled method for task-id prediction. We call the new method TPL (Task-id Prediction based on Likelihood Ratio). TPL markedly outperforms strong CIL baselines and has negligible catastrophic forgetting. The code of TPL is publicly available at https://github.com/linhaowei1/TPL.

Class Incremental Learning via Likelihood Ratio Based Task Prediction

TL;DR

Abstract

Paper Structure (52 sections, 3 theorems, 46 equations, 5 figures, 13 tables, 3 algorithms)

This paper contains 52 sections, 3 theorems, 46 equations, 5 figures, 13 tables, 3 algorithms.

Introduction
Related Work
Overview of the Proposed Method
Estimating Task-id Prediction Probability
Theoretical Analysis
Computing Task-ID Prediction Probability
Estimating $\mathcal{P}_{t}$ and $\mathcal{P}_{t^c}$ and Computing Likelihood Ratio
Combining with a Logit-Based Score
Converting Task-id Prediction Scores to Probabilities
Experiments
Experimental Setup
Results and Comparisons
Ablation Study
Conclusion
Appendix of TPL
...and 37 more sections

Key Result

Theorem 4.1

A test with rejection region $\mathcal{R}$ defined as follows is a unique uniformly most powerful (UMP) test for the hypothesis test problem defined in hypothesis test: where $\lambda_0$ is a threshold that can be chosen to obtain a specified significance level.

Figures (5)

Figure 1: Illustration of the proposed TPL. We use a pre-trained transformer network (in the grey box) (see Sec. \ref{['sec:exp setup']} for the case without using a pre-trained network). The pre-trained network is fixed and only the adapters houlsby2019parameter inserted into the transformer are trainable to adapt to specific tasks. It is important to note that the adapter (in yellow) used by HAT learns all tasks within the same adapter. The yellow boxes on the left show the progressive changes to the adapter as more tasks are learned.
Figure 2: Ablation Studies. Fig (a) illustrates the achieved ACC gain for each of the designed techniques on the five datasets; Fig (b) displays the average ACC results obtained from different choices of $E_{t}$ and $E_{t^c}$ for \ref{['eq:LR']}; Fig (c) showcases the results for various selections of $E_{\textit{logit}}$ for TPL in \ref{['eq:final_score']}.
Figure 3: The correlation between OOD (AUC) and CIL (ACC) results. Each point denotes the AUC and ACC of one method in \ref{['tab:ood']} on the same dataset.
Figure 4: Visualization of feature distribution of Task $t$ ($t$=1,2,3,4,5) data and the other 4 tasks. We use the trained task-specific feature extractor $h(\boldsymbol{x};\phi^{(t)})$ to extract features from the the training data that belongs to task $t$ (which represent $\mathcal{P}_{t}$) and the training data that belongs to the other 4 tasks (which represent $\mathcal{P}_{t^c}$).
Figure 5: A failure case of TIL+OOD methods that predict the task based on the likelihood of $\mathcal{P}_{t}$ (e.g., MORE and ROW). In the figure, the red star has higher likelihood in $\mathcal{P}_{t}$ ($t=1$) than the green star. However, the likelihood ratio between $\mathcal{P}_{t}$ and $\mathcal{P}_{t^c}$ of the red star is lower than the green star. The correct choice is to accept the green star to be from Task 1 instead of the red star.

Theorems & Definitions (5)

Theorem 4.1
Theorem 4.2
Definition 1: statistical hypothesis testing and rejection region
Definition 2: UMP test
Lemma E.1

Class Incremental Learning via Likelihood Ratio Based Task Prediction

TL;DR

Abstract

Class Incremental Learning via Likelihood Ratio Based Task Prediction

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (5)