Towards Diverse Evaluation of Class Incremental Learning: A Representation Learning Perspective

Sungmin Cha; Jihwan Kwak; Dongsub Shim; Hyunwoo Kim; Moontae Lee; Honglak Lee; Taesup Moon

Towards Diverse Evaluation of Class Incremental Learning: A Representation Learning Perspective

Sungmin Cha, Jihwan Kwak, Dongsub Shim, Hyunwoo Kim, Moontae Lee, Honglak Lee, Taesup Moon

TL;DR

This work experimentally analyzes neural network models trained by CIL algorithms using various evaluation protocols in representation learning and suggests that the representation-level evaluation should be considered as an additional recipe for more diverse evaluation for CIL algorithms.

Abstract

Class incremental learning (CIL) algorithms aim to continually learn new object classes from incrementally arriving data while not forgetting past learned classes. The common evaluation protocol for CIL algorithms is to measure the average test accuracy across all classes learned so far -- however, we argue that solely focusing on maximizing the test accuracy may not necessarily lead to developing a CIL algorithm that also continually learns and updates the representations, which may be transferred to the downstream tasks. To that end, we experimentally analyze neural network models trained by CIL algorithms using various evaluation protocols in representation learning and propose new analysis methods. Our experiments show that most state-of-the-art algorithms prioritize high stability and do not significantly change the learned representation, and sometimes even learn a representation of lower quality than a naive baseline. However, we observe that these algorithms can still achieve high test accuracy because they enable a model to learn a classifier that closely resembles an estimated linear classifier trained for linear probing. Furthermore, the base model learned in the first task, which involves single-task learning, exhibits varying levels of representation quality across different algorithms, and this variance impacts the final performance of CIL algorithms. Therefore, we suggest that the representation-level evaluation should be considered as an additional recipe for more diverse evaluation for CIL algorithms.

Towards Diverse Evaluation of Class Incremental Learning: A Representation Learning Perspective

TL;DR

Abstract

Paper Structure (17 sections, 1 equation, 11 figures, 3 tables)

This paper contains 17 sections, 1 equation, 11 figures, 3 tables.

Introduction
Related Work
Towards Diverse Evaluation of CIL from a Representation Perspective
Problem formulation and preliminaries
Proposed Evaluation method for analysis from a representation perspective
Experimental Setup
Experimental Results with the proposed evaluation
Achieving superior performance in conventional metrics does not always mean learning a superior representation
Most regularization-based CIL algorithms significantly prioritize stability
The superior performance of state-of-the-art algorithms might be attributed to their ability to learn a good output layer.
The quality of the representation learned in the first task can have a significant impact on the final evaluation
Concluding Remarks, Limitation and Future Work
Detailed Experimental Settings
Additional Experimental Results
Experimental analysis for other CIL algorithms
...and 2 more sections

Figures (11)

Figure 1: Experimental results of CIL using the ImageNet-100 dataset for a 10-tasks scenario. The accuracy of state-of-the-art regularization-based CIL algorithms have been gradually increasing, approaching that of Joint training (left). However, we experimentally confirm that the improvement of the quality of representations learned by them is negligible or even worse than naive baselines (right).
Figure 2: This figure illustrates proposed and used evaluation methods in this paper. (a): the standard evaluation method that measures classification accuracy on test data following training task $t$. (b): linear probing evaluation for a model trained on task $t$. (c): measuring CKA between representations of two models trained on different tasks. (d): a comparison of weights in the output layer.
Figure 3: The experimental results of regularization-based CIL algorithms for a 10-task scenario using the ImageNet-100 dataset. "Joint" refers to the performance of the upper bound case using the entire datasets.
Figure 4: The experimental results of regularization-based CIL algorithms for a 11-task scenario using the ImageNet-100 dataset. "Joint" refers to the performance of the upper bound case using the entire datasets.
Figure 5: CKA${(t_{1},t_{2})}$ in 10-tasks scenario for $t_{1}, t_{2}\in\{1,\dots,10\}$. Each ${CKA}(t_{1},t_{2})$ quantifies the similarity between representations of two models trained on distinct tasks. A deep red color indicates a higher level of similarity compared to a lighter shade of red.
...and 6 more figures

Towards Diverse Evaluation of Class Incremental Learning: A Representation Learning Perspective

TL;DR

Abstract

Towards Diverse Evaluation of Class Incremental Learning: A Representation Learning Perspective

Authors

TL;DR

Abstract

Table of Contents

Figures (11)