Table of Contents
Fetching ...

CLOSER: Towards Better Representation Learning for Few-Shot Class-Incremental Learning

Junghun Oh, Sungyong Baik, Kyoung Mu Lee

TL;DR

It is claimed that the closer different classes are, the better for FSCIL, in stark contrast to prior beliefs that the inter-class distance should be maximized.

Abstract

Aiming to incrementally learn new classes with only few samples while preserving the knowledge of base (old) classes, few-shot class-incremental learning (FSCIL) faces several challenges, such as overfitting and catastrophic forgetting. Such a challenging problem is often tackled by fixing a feature extractor trained on base classes to reduce the adverse effects of overfitting and forgetting. Under such formulation, our primary focus is representation learning on base classes to tackle the unique challenge of FSCIL: simultaneously achieving the transferability and the discriminability of the learned representation. Building upon the recent efforts for enhancing transferability, such as promoting the spread of features, we find that trying to secure the spread of features within a more confined feature space enables the learned representation to strike a better balance between transferability and discriminability. Thus, in stark contrast to prior beliefs that the inter-class distance should be maximized, we claim that the closer different classes are, the better for FSCIL. The empirical results and analysis from the perspective of information bottleneck theory justify our simple yet seemingly counter-intuitive representation learning method, raising research questions and suggesting alternative research directions. The code is available at https://github.com/JungHunOh/CLOSER_ECCV2024.

CLOSER: Towards Better Representation Learning for Few-Shot Class-Incremental Learning

TL;DR

It is claimed that the closer different classes are, the better for FSCIL, in stark contrast to prior beliefs that the inter-class distance should be maximized.

Abstract

Aiming to incrementally learn new classes with only few samples while preserving the knowledge of base (old) classes, few-shot class-incremental learning (FSCIL) faces several challenges, such as overfitting and catastrophic forgetting. Such a challenging problem is often tackled by fixing a feature extractor trained on base classes to reduce the adverse effects of overfitting and forgetting. Under such formulation, our primary focus is representation learning on base classes to tackle the unique challenge of FSCIL: simultaneously achieving the transferability and the discriminability of the learned representation. Building upon the recent efforts for enhancing transferability, such as promoting the spread of features, we find that trying to secure the spread of features within a more confined feature space enables the learned representation to strike a better balance between transferability and discriminability. Thus, in stark contrast to prior beliefs that the inter-class distance should be maximized, we claim that the closer different classes are, the better for FSCIL. The empirical results and analysis from the perspective of information bottleneck theory justify our simple yet seemingly counter-intuitive representation learning method, raising research questions and suggesting alternative research directions. The code is available at https://github.com/JungHunOh/CLOSER_ECCV2024.
Paper Structure (21 sections, 2 theorems, 14 equations, 8 figures, 7 tables)

This paper contains 21 sections, 2 theorems, 14 equations, 8 figures, 7 tables.

Key Result

theorem thmcountertheorem

The lower bound of $\frac{I(Y;Z)}{I(X;Z)}$ in Eq. eq:ib_inequality is a monotonically increasing function of $\lvert \Sigma_{W_i} \rvert$ and a monotonically decreasing function of $\lvert \Sigma_{T} \rvert$.

Figures (8)

  • Figure 1: Visualization of representation trained on MNIST.(a) Baselinezhang2021cechersche2022constrained exhibits great base-class discriminability (large inter-class distance) but weak transferability to the new classes (huge overlap between new and base classes leading to misclassification). (b) Baseline + representation spreadingkornblith2021whyislam2021broadchen2022perfectly benefits the new classes (less collapse to the base classes), while compromising base-class discriminability in the context of FSCIL (dispersed intra-class features leading to less accurate class representation with class prototypes). (c) CLOSER (Ours): Dispersing features in a narrowed feature space enhances both discriminability on the base classes (less deviation between intra-class features and class prototypes) and transferability to the new classes (even less overlap between the base and new classes).For instance, the 4 and 9 classes are not distinguishable in (b) and even less in (a), but CLOSER can yield representation that successfully discriminates them.
  • Figure 2: The impact of the spread of representation. Stronger emphasis on self-supervised contrastive loss (larger $\lambda_\text{ssc}$) and low temperature (skyblue) enhances the new-class performance $A_N$ (left), but at the expense of base-class performance $A_B$ (center). The reduced base-class performance is mainly attributed to the excessive intra-class variation, adversely affecting the class prototype representation (right). The experiments are conducted on CIFAR100 dataset.
  • Figure 3: Effect of minimizing inter-class distance. As the weight of $\mathcal{L}_{\text{inter}}$, denoted by $\lambda_{\text{inter}}$, increases, the performance on the new classes increases (skyblue) and the performance loss on the base classes induced by CR is greatly alleviated (purple). The experiments are conducted on CUB200 dataset.
  • Figure 4: (a) Sanity test for $\mathcal{T}(f_{\boldsymbol{\theta}})$: $\mathcal{T}(f_{\boldsymbol{\theta}})$ has a positive correlation with the performance on the new classes. Each data point is obtained by different configurations of $\tau$ and $\lambda_{\text{ssc}}$ (without $\mathcal{L}_{\text{inter}}$). (b),(c) Relationship between inter-class distance, $\mathcal{T}(f_{\boldsymbol{\theta}})$, and $A_N$: Integrated with the representation spreading, reducing inter-class distance encourages better transferability (red points). However, the tendency is broken when reducing inter-class distance without representation spreading (blue points). Please refer to Section \ref{['sec:ib']} for theoretical support for these observations. The dots with greater transparency correspond to smaller $\lambda_{\text{inter}}$, ranging from 0 to 1 with intervals of 0.1. We set $\lambda_{\text{ssc}}$ as 0.1 when it is used. The experiments are conducted on CIFAR100 dataset.
  • Figure 5: Information bottleneck trade-off analysis. We compare representations acquired by three different methods by assessing the information bottleneck (IB) trade-off. 'RS' refers to representation spreading methods. We indicate the final models with black edges. The experiments are conducted on the CIFAR100 dataset.
  • ...and 3 more figures

Theorems & Definitions (4)

  • theorem thmcountertheorem
  • proof
  • lemma S1
  • proof