Table of Contents
Fetching ...

On Distilling the Displacement Knowledge for Few-Shot Class-Incremental Learning

Pengfei Fang, Yongchun Qin, Hui Xue

TL;DR

This work addresses catastrophic forgetting in FSCIL by shifting from traditional similarity-based relational distillation to Displacement Knowledge Distillation (DKD), which preserves full structural relationships via pairwise displacement vectors in the original feature space. The authors propose the Dual Distillation Network (DDNet), combining IKD for base classes with DKD for novel classes, and an instance-aware sample selector to fuse predictions from both branches during inference. Empirical results on CIFAR-100, miniImageNet, and CUB-200 show state-of-the-art performance in terms of Knowledge Retention (KR) and robustness to outliers, with DKD providing notable gains in novel-class discrimination. The methodology generalizes beyond FSCIL to broader class-incremental learning settings, suggesting DKD as a versatile distillation paradigm for maintaining distributional consistency across sessions.

Abstract

Few-shot Class-Incremental Learning (FSCIL) addresses the challenges of evolving data distributions and the difficulty of data acquisition in real-world scenarios. To counteract the catastrophic forgetting typically encountered in FSCIL, knowledge distillation is employed as a way to maintain the knowledge from learned data distribution. Recognizing the limitations of generating discriminative feature representations in a few-shot context, our approach incorporates structural information between samples into knowledge distillation. This structural information serves as a remedy for the low quality of features. Diverging from traditional structured distillation methods that compute sample similarity, we introduce the Displacement Knowledge Distillation (DKD) method. DKD utilizes displacement rather than similarity between samples, incorporating both distance and angular information to significantly enhance the information density retained through knowledge distillation. Observing performance disparities in feature distribution between base and novel classes, we propose the Dual Distillation Network (DDNet). This network applies traditional knowledge distillation to base classes and DKD to novel classes, challenging the conventional integration of novel classes with base classes. Additionally, we implement an instance-aware sample selector during inference to dynamically adjust dual branch weights, thereby leveraging the complementary strengths of each approach. Extensive testing on three benchmarks demonstrates that DDNet achieves state-of-the-art results. Moreover, through rigorous experimentation and comparison, we establish the robustness and general applicability of our proposed DKD method.

On Distilling the Displacement Knowledge for Few-Shot Class-Incremental Learning

TL;DR

This work addresses catastrophic forgetting in FSCIL by shifting from traditional similarity-based relational distillation to Displacement Knowledge Distillation (DKD), which preserves full structural relationships via pairwise displacement vectors in the original feature space. The authors propose the Dual Distillation Network (DDNet), combining IKD for base classes with DKD for novel classes, and an instance-aware sample selector to fuse predictions from both branches during inference. Empirical results on CIFAR-100, miniImageNet, and CUB-200 show state-of-the-art performance in terms of Knowledge Retention (KR) and robustness to outliers, with DKD providing notable gains in novel-class discrimination. The methodology generalizes beyond FSCIL to broader class-incremental learning settings, suggesting DKD as a versatile distillation paradigm for maintaining distributional consistency across sessions.

Abstract

Few-shot Class-Incremental Learning (FSCIL) addresses the challenges of evolving data distributions and the difficulty of data acquisition in real-world scenarios. To counteract the catastrophic forgetting typically encountered in FSCIL, knowledge distillation is employed as a way to maintain the knowledge from learned data distribution. Recognizing the limitations of generating discriminative feature representations in a few-shot context, our approach incorporates structural information between samples into knowledge distillation. This structural information serves as a remedy for the low quality of features. Diverging from traditional structured distillation methods that compute sample similarity, we introduce the Displacement Knowledge Distillation (DKD) method. DKD utilizes displacement rather than similarity between samples, incorporating both distance and angular information to significantly enhance the information density retained through knowledge distillation. Observing performance disparities in feature distribution between base and novel classes, we propose the Dual Distillation Network (DDNet). This network applies traditional knowledge distillation to base classes and DKD to novel classes, challenging the conventional integration of novel classes with base classes. Additionally, we implement an instance-aware sample selector during inference to dynamically adjust dual branch weights, thereby leveraging the complementary strengths of each approach. Extensive testing on three benchmarks demonstrates that DDNet achieves state-of-the-art results. Moreover, through rigorous experimentation and comparison, we establish the robustness and general applicability of our proposed DKD method.

Paper Structure

This paper contains 20 sections, 26 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: (a) and (b) respectively show two different structures with ${\boldsymbol{x}}_1$ as the center, and are symmetric about the origin. (c) and (d) respectively show how RKD and DKD measure these two structures. The measure of RKD is 1-dimensional, and the obtained structure measure completely overlaps. DKD measures structural information in the original dimension so it can effectively distinguish between the two types of structures.
  • Figure 2: (a) Performance differences between base and novel classes on different methods. (b) The t-SNE visualization of features from base and novel classes respectively. According to the settings of FSCIL, the base classes have abundant training data and novel classes follow the few-shot setting, thus causing a performance gap between base and novel classes.
  • Figure 3: The framework of the DDNet and the illustration of DKD. We employs IKD to preserve the base knowledge and the proposed DKD method to protect the novel knowledge through the structural relationship. Logits from different sessions are integrated into a final prediction through the sample selector.
  • Figure 4: The illustration of differences between (a) IKD, (b) RKD, and (c) DKD. IKD directly computes the KL-divergence of teacher and student's output and thus cannot model the relationship between samples. RKD measures the relation of a sample pair via similarity, and the loss is the sum of KL-divergence of every row of the similarity matrix. DKD preserves all the structural information by making differences between samples and times the number of "teacher-student" pairs by $N-1$. Evidently, RKD leads to coupling between samples, whereas DKD completely avoids such relationships.
  • Figure 5: The illustration of the gradient of DKD. The red part represents the pre-sequence relation, and the blue is the post-sequence. Our proposed DKD includes bidirectional structural information of "teacher-student" pairs.
  • ...and 4 more figures