Table of Contents
Fetching ...

Densely Distilling Cumulative Knowledge for Continual Learning

Zenglin Shi, Pei Liu, Tong Su, Yunpeng Wu, Kuien Liu, Yu Song, Meng Wang

TL;DR

This work tackles catastrophic forgetting in class-incremental continual learning by introducing Dense Knowledge Distillation (DKD), which uses a task pool to track capabilities and partitions the model's output logits into dense groups corresponding to tasks. DKD distills knowledge from all groups, with a practical random-group sampling strategy to reduce computational cost, and employs an adaptive weighting scheme based on old-new class counts and cross-class similarity. The approach yields two variants, Full Dense KD (FDKD) and Random Dense KD (RDKD), and demonstrates state-of-the-art performance across CIFAR100 and ImageNet-100/1000 benchmarks, along with robustness to memory budgets and task orders. Empirical results indicate improved model stability, flatter minima, and seamless compatibility with other continual learning methods and offline applications like model compression, highlighting DKD’s practical impact for scalable continual learning.

Abstract

Continual learning, involving sequential training on diverse tasks, often faces catastrophic forgetting. While knowledge distillation-based approaches exhibit notable success in preventing forgetting, we pinpoint a limitation in their ability to distill the cumulative knowledge of all the previous tasks. To remedy this, we propose Dense Knowledge Distillation (DKD). DKD uses a task pool to track the model's capabilities. It partitions the output logits of the model into dense groups, each corresponding to a task in the task pool. It then distills all tasks' knowledge using all groups. However, using all the groups can be computationally expensive, we also suggest random group selection in each optimization step. Moreover, we propose an adaptive weighting scheme, which balances the learning of new classes and the retention of old classes, based on the count and similarity of the classes. Our DKD outperforms recent state-of-the-art baselines across diverse benchmarks and scenarios. Empirical analysis underscores DKD's ability to enhance model stability, promote flatter minima for improved generalization, and remains robust across various memory budgets and task orders. Moreover, it seamlessly integrates with other CL methods to boost performance and proves versatile in offline scenarios like model compression.

Densely Distilling Cumulative Knowledge for Continual Learning

TL;DR

This work tackles catastrophic forgetting in class-incremental continual learning by introducing Dense Knowledge Distillation (DKD), which uses a task pool to track capabilities and partitions the model's output logits into dense groups corresponding to tasks. DKD distills knowledge from all groups, with a practical random-group sampling strategy to reduce computational cost, and employs an adaptive weighting scheme based on old-new class counts and cross-class similarity. The approach yields two variants, Full Dense KD (FDKD) and Random Dense KD (RDKD), and demonstrates state-of-the-art performance across CIFAR100 and ImageNet-100/1000 benchmarks, along with robustness to memory budgets and task orders. Empirical results indicate improved model stability, flatter minima, and seamless compatibility with other continual learning methods and offline applications like model compression, highlighting DKD’s practical impact for scalable continual learning.

Abstract

Continual learning, involving sequential training on diverse tasks, often faces catastrophic forgetting. While knowledge distillation-based approaches exhibit notable success in preventing forgetting, we pinpoint a limitation in their ability to distill the cumulative knowledge of all the previous tasks. To remedy this, we propose Dense Knowledge Distillation (DKD). DKD uses a task pool to track the model's capabilities. It partitions the output logits of the model into dense groups, each corresponding to a task in the task pool. It then distills all tasks' knowledge using all groups. However, using all the groups can be computationally expensive, we also suggest random group selection in each optimization step. Moreover, we propose an adaptive weighting scheme, which balances the learning of new classes and the retention of old classes, based on the count and similarity of the classes. Our DKD outperforms recent state-of-the-art baselines across diverse benchmarks and scenarios. Empirical analysis underscores DKD's ability to enhance model stability, promote flatter minima for improved generalization, and remains robust across various memory budgets and task orders. Moreover, it seamlessly integrates with other CL methods to boost performance and proves versatile in offline scenarios like model compression.
Paper Structure (14 sections, 5 equations, 5 figures, 6 tables)

This paper contains 14 sections, 5 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: An illustrative example showcasing the superiority of our method. A model that has learned $\{T^1,T^2,T^3\}$ sequentially should be able to solve them individually and jointly, as shown by the gray region. Existing distillation methods transfer knowledge for either global (e.g., GKD) or individual tasks (e.g., TKD). In contrast, our DKD transfers cumulative knowledge for all tasks and prevents forgetting better.
  • Figure 2: Illustration of our approaches. When incrementally learning three classification tasks involving classes $\{C^0, C^1, C^2\}$ using a model $f$, a task pool $P$ monitors the tasks within the model's capabilities. The ideal capability of model $f$ includes recognizing task-specific classes individually and their combined classes. Our DKD facilitates the transfer of cumulative knowledge for recognizing both task-specific and combined classes, as indicated in the task pool, from earlier models to the new model.
  • Figure 3: Results on CIFAR100. Concerning both the average incremental accuracy and the accuracy of the initial task at each incremental step when $T=5$ and $T=10$, our method consistently outperforms throughout all stages. The minimal decline in accuracy on the initial task across incremental steps indicates reduced forgetting.
  • Figure 4: Enhancing stability and promoting flatter minima. (a) In comparison to GKD and TKD, our RDKD consistently achieves higher accuracy on old classes across all stages. This indicates that RDKD reinforces stability more effectively than GKD and TKD; (b) RDKD exhibits lower sensitivity to perturbations compared to GKD and TKD, which highlights RDKD's capability to achieve flatter minima.
  • Figure 5: Robust to memory budget. RDKD consistently surpasses other baselines across varying memory sizes in terms of the average incremental accuracy on CIFAR100. Despite a decrease in memory size, RDKD exhibits minimal accuracy degradation, showcasing robustness in exemplar memory size.