Online Continual Learning via Logit Adjusted Softmax

Zhehao Huang; Tao Li; Chenhe Yuan; Yingwen Wu; Xiaolin Huang

Online Continual Learning via Logit Adjusted Softmax

Zhehao Huang, Tao Li, Chenhe Yuan, Yingwen Wu, Xiaolin Huang

TL;DR

This paper theoretically analyzes that inter-class imbalance is entirely attributed to imbalanced class-priors, and the function learned from intra-class intrinsic distributions is the Bayes-optimal classifier, and presents that a simple adjustment of model logits during training can effectively resist prior class bias and pursue the corresponding Baye-optimum.

Abstract

Online continual learning is a challenging problem where models must learn from a non-stationary data stream while avoiding catastrophic forgetting. Inter-class imbalance during training has been identified as a major cause of forgetting, leading to model prediction bias towards recently learned classes. In this paper, we theoretically analyze that inter-class imbalance is entirely attributed to imbalanced class-priors, and the function learned from intra-class intrinsic distributions is the Bayes-optimal classifier. To that end, we present that a simple adjustment of model logits during training can effectively resist prior class bias and pursue the corresponding Bayes-optimum. Our proposed method, Logit Adjusted Softmax, can mitigate the impact of inter-class imbalance not only in class-incremental but also in realistic general setups, with little additional computational cost. We evaluate our approach on various benchmarks and demonstrate significant performance improvements compared to prior arts. For example, our approach improves the best baseline by 4.6% on CIFAR10.

Online Continual Learning via Logit Adjusted Softmax

TL;DR

Abstract

Paper Structure (43 sections, 2 theorems, 18 equations, 5 figures, 11 tables, 3 algorithms)

This paper contains 43 sections, 2 theorems, 18 equations, 5 figures, 11 tables, 3 algorithms.

Introduction
Problem Setup
Statistical View for Time-varying Distribution Learning
Method
Logit Adjustment Technique
Logit Adjusted Softmax Cross-entropy Loss
Estimator for Time-varying Class-priors
Related Work
Experiment
Benchmark setups.
Results on Online Class-IL Scenarios
Results on Online Blurry CL Scenarios
Gains on Enhanced Methods
Ablation Studies
Conclusion
...and 28 more sections

Key Result

Theorem 3.1

For the time-varying distribution $\rho_t$, given that its class-conditionals keep the same throughout time, i.e., $\forall t,\mathbb P(x|y,\rho_t)=\mathbb P(x|y,\rho_0)$, the class-conditional function satisfies the optimal classifier $\Phi^*_t$ that minimizes the class-balanced error,

Figures (5)

Figure 1: Left is the diagram of Experience Replay (ER) with our proposed Logit Adjusted Softmax and a batch-wise sliding-window estimator (ER-LAS). LAS helps mitigate the inter-class imbalance problem by adding label frequencies to predicted logits. The model in ER-LAS is still trained via the softmax cross-entropy loss. And right is model prediction test samples by Fine-Tune, ER, and ER-LAS on C-CIFAR100 (10 tasks). The gray dashed line indicates the ground truth task-wise distribution ($1k$ for each). We count according to the tasks to which the predicted classes belong.
Figure 2: The number of classes per task in divided iNaturalist. Each one of these 26 tasks contains categories with the same corresponding initial letter.
Figure 3: An illustration of the occurrence of subclasses within each superclass for every task in S-CIFAR100 (20 tasks). The $y$-axis represents the number of occurrences of subclasses. The $x$-axis represents the 20 superclasses. Worth noting that each subclass is a distinct domain.
Figure 4: Comparison with online CL methods based on contrastive learning on C-CIFAR10 (5 tasks). Memory size $M=1k$. The $x$-axis represents training time, and the $y$-axis represents the final average accuracy $A_T$ (higher is better). We evaluate the accuracy and the time efficiency of SCR, OCM, and our ER-LAS at batch sizes of 8, 16, 32, and 64. Noting that the time consumption increases as the batch size decreases.
Figure 5: Prediction results by ER and ER-LAS on C-ImageNet (90 tasks). We calculate the average accuracy of classes within each task to demonstrate the recency bias.

Theorems & Definitions (2)

Theorem 3.1
Theorem A.1

Online Continual Learning via Logit Adjusted Softmax

TL;DR

Abstract

Online Continual Learning via Logit Adjusted Softmax

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (2)