Table of Contents
Fetching ...

Low-redundancy Distillation for Continual Learning

RuiQi Liu, Boyu Diao, Libo Huang, Zijia An, Hangda Liu, Zhulin An, Yongjun Xu

TL;DR

Low-redundancy Distillation (LoRD), a novel CL method that enhances model performance while maintaining training efficiency, is proposed by eliminating redundancy in three aspects of CL: student model redundancy, teacher model redundancy, and rehearsal sample redundancy.

Abstract

Continual learning (CL) aims to learn new tasks without erasing previous knowledge. However, current CL methods primarily emphasize improving accuracy while often neglecting training efficiency, which consequently restricts their practical application. Drawing inspiration from the brain's contextual gating mechanism, which selectively filters neural information and continuously updates past memories, we propose Low-redundancy Distillation (LoRD), a novel CL method that enhances model performance while maintaining training efficiency. This is achieved by eliminating redundancy in three aspects of CL: student model redundancy, teacher model redundancy, and rehearsal sample redundancy. By compressing the learnable parameters of the student model and pruning the teacher model, LoRD facilitates the retention and optimization of prior knowledge, effectively decoupling task-specific knowledge without manually assigning isolated parameters for each task. Furthermore, we optimize the selection of rehearsal samples and refine rehearsal frequency to improve training efficiency. Through a meticulous design of distillation and rehearsal strategies, LoRD effectively balances training efficiency and model precision. Extensive experimentation across various benchmark datasets and environments demonstrates LoRD's superiority, achieving the highest accuracy with the lowest training FLOPs.

Low-redundancy Distillation for Continual Learning

TL;DR

Low-redundancy Distillation (LoRD), a novel CL method that enhances model performance while maintaining training efficiency, is proposed by eliminating redundancy in three aspects of CL: student model redundancy, teacher model redundancy, and rehearsal sample redundancy.

Abstract

Continual learning (CL) aims to learn new tasks without erasing previous knowledge. However, current CL methods primarily emphasize improving accuracy while often neglecting training efficiency, which consequently restricts their practical application. Drawing inspiration from the brain's contextual gating mechanism, which selectively filters neural information and continuously updates past memories, we propose Low-redundancy Distillation (LoRD), a novel CL method that enhances model performance while maintaining training efficiency. This is achieved by eliminating redundancy in three aspects of CL: student model redundancy, teacher model redundancy, and rehearsal sample redundancy. By compressing the learnable parameters of the student model and pruning the teacher model, LoRD facilitates the retention and optimization of prior knowledge, effectively decoupling task-specific knowledge without manually assigning isolated parameters for each task. Furthermore, we optimize the selection of rehearsal samples and refine rehearsal frequency to improve training efficiency. Through a meticulous design of distillation and rehearsal strategies, LoRD effectively balances training efficiency and model precision. Extensive experimentation across various benchmark datasets and environments demonstrates LoRD's superiority, achieving the highest accuracy with the lowest training FLOPs.
Paper Structure (38 sections, 1 theorem, 18 equations, 7 figures, 12 tables, 1 algorithm)

This paper contains 38 sections, 1 theorem, 18 equations, 7 figures, 12 tables, 1 algorithm.

Key Result

Theorem 1

For any positive integer $k (\mathcal{B} < k )$, the following equation holds:

Figures (7)

  • Figure 1: Overview of LoRD. We employ Sample Refining to achieve a more balanced sample selection, Student Model Compression to reduce forward and backward propagation FLOPs, and Teacher Model Pruning to minimize distillation parameters. Each of these techniques targets redundancy reduction, leading to improved accuracy and enhanced training efficiency.
  • Figure 2: The main framework of LoRD. Above the dashed line: During the training of each task, we prune the teacher model into a teacher subnet and distill the student subnet that has the same structure as the teacher subnet. Additionally, we propose Teacher-aware Reservoir Sampling to optimize the selection of replay samples. Below the dashed line: We compress the number of learnable parameters in the student model to reduce redundancy. At each task boundary, we assign these learnable parameters to the student model to ensure its plasticity.
  • Figure 3: Classification results for standard buffer-free CL benchmarks.
  • Figure 4: Results of different methods on S-CIFAR-100 with an unknown number of tasks and a buffer size of 500.
  • Figure 5: Results for the quantitative analysis. (a) and (b) illustrate the accuracy at varying rehearsal frequencies. (c) displays the training FLOPs and accuracy of different methods.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Theorem 1