Table of Contents
Fetching ...

Towards Redundancy-Free Sub-networks in Continual Learning

Cheng Chen, Jingkuan Song, LianLi Gao, Heng Tao Shen

TL;DR

This work addresses catastrophic forgetting in continual learning by introducing Information Bottleneck based sub-network Masking (IBM). IBM leverages an information-theoretic objective to prune redundancy within task-specific sub-networks, using a weight-space variational formulation and masks derived from parameter statistics to construct redundancy-free sub-networks. It freezes essential weights to prevent forgetting while reusing and reinitializing variational parameters to promote knowledge transfer, and it adds a feature-decomposition module that automatically sets layer-wise pruning ratios via hidden representation analysis. Empirical results show IBM achieves state-of-the-art performance on multiple benchmarks with substantial reductions in sub-network parameters (~70%) and training time (~80%), and it demonstrates strong capacity for longer task sequences thanks to reduced redundancy and better knowledge transfer.

Abstract

Catastrophic Forgetting (CF) is a prominent issue in continual learning. Parameter isolation addresses this challenge by masking a sub-network for each task to mitigate interference with old tasks. However, these sub-networks are constructed relying on weight magnitude, which does not necessarily correspond to the importance of weights, resulting in maintaining unimportant weights and constructing redundant sub-networks. To overcome this limitation, inspired by information bottleneck, which removes redundancy between adjacent network layers, we propose \textbf{\underline{I}nformation \underline{B}ottleneck \underline{M}asked sub-network (IBM)} to eliminate redundancy within sub-networks. Specifically, IBM accumulates valuable information into essential weights to construct redundancy-free sub-networks, not only effectively mitigating CF by freezing the sub-networks but also facilitating new tasks training through the transfer of valuable knowledge. Additionally, IBM decomposes hidden representations to automate the construction process and make it flexible. Extensive experiments demonstrate that IBM consistently outperforms state-of-the-art methods. Notably, IBM surpasses the state-of-the-art parameter isolation method with a 70\% reduction in the number of parameters within sub-networks and an 80\% decrease in training time.

Towards Redundancy-Free Sub-networks in Continual Learning

TL;DR

This work addresses catastrophic forgetting in continual learning by introducing Information Bottleneck based sub-network Masking (IBM). IBM leverages an information-theoretic objective to prune redundancy within task-specific sub-networks, using a weight-space variational formulation and masks derived from parameter statistics to construct redundancy-free sub-networks. It freezes essential weights to prevent forgetting while reusing and reinitializing variational parameters to promote knowledge transfer, and it adds a feature-decomposition module that automatically sets layer-wise pruning ratios via hidden representation analysis. Empirical results show IBM achieves state-of-the-art performance on multiple benchmarks with substantial reductions in sub-network parameters (~70%) and training time (~80%), and it demonstrates strong capacity for longer task sequences thanks to reduced redundancy and better knowledge transfer.

Abstract

Catastrophic Forgetting (CF) is a prominent issue in continual learning. Parameter isolation addresses this challenge by masking a sub-network for each task to mitigate interference with old tasks. However, these sub-networks are constructed relying on weight magnitude, which does not necessarily correspond to the importance of weights, resulting in maintaining unimportant weights and constructing redundant sub-networks. To overcome this limitation, inspired by information bottleneck, which removes redundancy between adjacent network layers, we propose \textbf{\underline{I}nformation \underline{B}ottleneck \underline{M}asked sub-network (IBM)} to eliminate redundancy within sub-networks. Specifically, IBM accumulates valuable information into essential weights to construct redundancy-free sub-networks, not only effectively mitigating CF by freezing the sub-networks but also facilitating new tasks training through the transfer of valuable knowledge. Additionally, IBM decomposes hidden representations to automate the construction process and make it flexible. Extensive experiments demonstrate that IBM consistently outperforms state-of-the-art methods. Notably, IBM surpasses the state-of-the-art parameter isolation method with a 70\% reduction in the number of parameters within sub-networks and an 80\% decrease in training time.
Paper Structure (22 sections, 14 equations, 5 figures, 8 tables)

This paper contains 22 sections, 14 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: The total number of parameters within sub-networks and average training time for each task of WSN DBLP:conf/icml/KangMMYHHY22 and our IBM on CIFAR-100 with Resnet-18. Significantly, our method surpasses WSN by 1.68% with a 70% reduction in the number of parameters and an 80% decrease in training time.
  • Figure 2: An overview of the proposed IBM. After training on Task T, a binary mask is constructed based on choosing a subset of variational parameters. The mask and these parameters are copied and saved in a memory pool for inference. Then, before training on T+1, we maintain the chosen parameters and re-initialize the rest parameters to facilitate knowledge transfer. Finally, when training on Task T+1, indicated by the masks learned before, the corresponding weights of the network are frozen to solve catastrophic forgetting.
  • Figure 3: The visualization of masked weight numbers within each layer of the first sub-network constructed by WSN and IBM on CIFAR-100 with Resnet-18. The significant reduction of weight numbers substantiates the conjecture of employing the information bottleneck for reducing redundancy within sub-networks.
  • Figure 4: The mean and standard deviation results of ablation studies about feature decomposing interval on CIFAR-100 dataset with ResNet-18 as the backbone. We choose 50 epoch interval as the main experiment setting to balance the performance and efficiency.
  • Figure 5: The mean and standard deviation results of ablation studies about feature decomposing interval on TinyImageNet and MiniImageNet dataset with ResNet-18 as the backbone.