Table of Contents
Fetching ...

LCA: Local Classifier Alignment for Continual Learning

Tung Tran, Danilo Vasconcellos Vargas, Khoat Than

TL;DR

A complete solution for continual learning is developed, following the model merging approach and using LCA, which can enable the classifier to not only generalize well for all observed tasks, but also improve robustness.

Abstract

A fundamental requirement for intelligent systems is the ability to learn continuously under changing environments. However, models trained in this regime often suffer from catastrophic forgetting. Leveraging pre-trained models has recently emerged as a promising solution, since their generalized feature extractors enable faster and more robust adaptation. While some earlier works mitigate forgetting by fine-tuning only on the first task, this approach quickly deteriorates as the number of tasks grows and the data distributions diverge. More recent research instead seeks to consolidate task knowledge into a unified backbone, or adapting the backbone as new tasks arrive. However, such approaches may create a (potential) \textit{mismatch} between task-specific classifiers and the adapted backbone. To address this issue, we propose a novel \textit{Local Classifier Alignment} (LCA) loss to better align the classifier with backbone. Theoretically, we show that this LCA loss can enable the classifier to not only generalize well for all observed tasks, but also improve robustness. Furthermore, we develop a complete solution for continual learning, following the model merging approach and using LCA. Extensive experiments on several standard benchmarks demonstrate that our method often achieves leading performance, sometimes surpasses the state-of-the-art methods with a large margin.

LCA: Local Classifier Alignment for Continual Learning

TL;DR

A complete solution for continual learning is developed, following the model merging approach and using LCA, which can enable the classifier to not only generalize well for all observed tasks, but also improve robustness.

Abstract

A fundamental requirement for intelligent systems is the ability to learn continuously under changing environments. However, models trained in this regime often suffer from catastrophic forgetting. Leveraging pre-trained models has recently emerged as a promising solution, since their generalized feature extractors enable faster and more robust adaptation. While some earlier works mitigate forgetting by fine-tuning only on the first task, this approach quickly deteriorates as the number of tasks grows and the data distributions diverge. More recent research instead seeks to consolidate task knowledge into a unified backbone, or adapting the backbone as new tasks arrive. However, such approaches may create a (potential) \textit{mismatch} between task-specific classifiers and the adapted backbone. To address this issue, we propose a novel \textit{Local Classifier Alignment} (LCA) loss to better align the classifier with backbone. Theoretically, we show that this LCA loss can enable the classifier to not only generalize well for all observed tasks, but also improve robustness. Furthermore, we develop a complete solution for continual learning, following the model merging approach and using LCA. Extensive experiments on several standard benchmarks demonstrate that our method often achieves leading performance, sometimes surpasses the state-of-the-art methods with a large margin.
Paper Structure (25 sections, 3 theorems, 18 equations, 13 figures, 3 tables, 1 algorithm)

This paper contains 25 sections, 3 theorems, 18 equations, 13 figures, 3 tables, 1 algorithm.

Key Result

Theorem 3.1

Consider a model $h_t$ learned from a dataset ${\bm{D}} = \{{\bm{D}}_1, ..., {\bm{D}}_t\}$, where ${\bm{D}}_i$ contains $n_i$ i.i.d. samples from distribution ${\mathcal{N}}_i$ for each $i \le C_t$, and a bounded loss $\ell$. Denote $P = \frac{1}{C_t} \sum_{i=1}^{C_t} {\mathcal{N}}_i$ as the overall

Figures (13)

  • Figure 1: A comparison between IM and IM+LCA. IM is the result after only done the Incremental Merging step, while IM+LCA has Local Classifier Alignment as the last step.
  • Figure 2: Performance curves of different methods across all tasks and datasets. All methods use ViT-B/16-IN1K as the pre-trained backbone without any additional exemplars.
  • Figure 3: Effect of $\lambda$ on the accuracy on two datasets.
  • Figure 4: (a) Complementary evaluation of LCA when using LCA for MOS and SLCA. (b) Robustness performance of IM and IM+LCA on corruption and perturbation benchmarks.
  • Figure 5: Accuracy performance of IM and IM+LCA under different corruption and perturbation types. The relative difference between IM and IM+LCA is highlighted.
  • ...and 8 more figures

Theorems & Definitions (6)

  • Theorem 3.1
  • Corollary 1
  • Remark 1
  • Theorem 3.2
  • proof : Proof of Theorem \ref{['thm-LCA-generalization']}
  • proof : Proof of Theorem \ref{['thm-LCA-generalization-change']}