Table of Contents
Fetching ...

Overcoming Growth-Induced Forgetting in Task-Agnostic Continual Learning

Yuqing Zhao, Jiannong Cao, Divya Saxena, Xiaoyun Liu, Changlin Song, Bo Yuan, Julie McCann

TL;DR

Growth of model capacity in task-agnostic continual learning can trigger forgetting when the entire grown model is used for inference. The authors identify growth-induced forgetting and show existing growth strategies differ in forgetting risks, with layer expansion offering a path to reduce forgetting. They propose SparseGrow, combining layer expansion, gradient gating, and sparse training/initialization to enable targeted updates and controlled plasticity. Extensive experiments across domain- and class-incremental datasets demonstrate SparseGrow achieves high adaptability while minimizing forgetting, outperforming baselines with modest parameter overhead.

Abstract

In continual learning (CL), model growth enhances adaptability to new data. However, when model growth is applied improperly, especially in task-agnostic CL, where the entire grown model is used for inference, it can lead to severe degradation of learned knowledge, a problem we term growth-induced forgetting. Most existing methods that adopt model growth to improve adaptability often overlook the forgetting issue, resulting in compromised knowledge retention, making them unsuitable for task-agnostic settings. To promote both adaptability and knowledge retention with model growth, we identify the key: gradient and parameter sparsity. Introducing SparseGrow, which increases gradient sparsity through layer expansion and gradient gating to enable focused updates on parameters while preserving critical parameters, thus inhibiting forgetting. Moreover, it promotes parameter sparsity with sparse initialization and training, aiming at better control of model plasticity, improving adaptability over new data. Extensive experiments across diverse datasets, task-agnostic settings, and a large number of tasks demonstrate the necessity of controlled layer expansion and validate the effectiveness of SparseGrow in achieving high adaptability while minimizing forgetting in continual learning. By enabling model growth with sparsified gradients and parameters, SparseGrow paves the way for building scalable lifelong learning systems capable of continual adaptation with better knowledge retention.

Overcoming Growth-Induced Forgetting in Task-Agnostic Continual Learning

TL;DR

Growth of model capacity in task-agnostic continual learning can trigger forgetting when the entire grown model is used for inference. The authors identify growth-induced forgetting and show existing growth strategies differ in forgetting risks, with layer expansion offering a path to reduce forgetting. They propose SparseGrow, combining layer expansion, gradient gating, and sparse training/initialization to enable targeted updates and controlled plasticity. Extensive experiments across domain- and class-incremental datasets demonstrate SparseGrow achieves high adaptability while minimizing forgetting, outperforming baselines with modest parameter overhead.

Abstract

In continual learning (CL), model growth enhances adaptability to new data. However, when model growth is applied improperly, especially in task-agnostic CL, where the entire grown model is used for inference, it can lead to severe degradation of learned knowledge, a problem we term growth-induced forgetting. Most existing methods that adopt model growth to improve adaptability often overlook the forgetting issue, resulting in compromised knowledge retention, making them unsuitable for task-agnostic settings. To promote both adaptability and knowledge retention with model growth, we identify the key: gradient and parameter sparsity. Introducing SparseGrow, which increases gradient sparsity through layer expansion and gradient gating to enable focused updates on parameters while preserving critical parameters, thus inhibiting forgetting. Moreover, it promotes parameter sparsity with sparse initialization and training, aiming at better control of model plasticity, improving adaptability over new data. Extensive experiments across diverse datasets, task-agnostic settings, and a large number of tasks demonstrate the necessity of controlled layer expansion and validate the effectiveness of SparseGrow in achieving high adaptability while minimizing forgetting in continual learning. By enabling model growth with sparsified gradients and parameters, SparseGrow paves the way for building scalable lifelong learning systems capable of continual adaptation with better knowledge retention.
Paper Structure (25 sections, 12 equations, 8 figures, 4 tables, 2 algorithms)

This paper contains 25 sections, 12 equations, 8 figures, 4 tables, 2 algorithms.

Figures (8)

  • Figure 1: Comparison of three model growth strategies applied to ResNet in task-agnostic continual learning (CL) across sequential domains, along with a baseline without model growth (No Growth). The strategies include: Layer Expansion: Widening existing layers (e.g., DEN, RCL, DER) to increase gradient sparsity and resist forgetting. Lateral Connections: Adding new lateral layers (e.g., PNNs, VariGrow, P&C, REC) for model adaptability. In-Depth Growth: Increasing model depth via added hidden layers (e.g., REC, Kozal et al.) for enhanced adaptability. (a) shows the change in average accuracy across domains; (b) reports both average accuracy and backward transfer, a measure of forgetting. These results provide initial evidence for our study of growth-induced forgetting in task-agnostic CL. While Layer Expansion demonstrates superior accuracy and minimal forgetting by increasing gradient sparsity, other growth strategies may lead to higher degradation of past knowledge. This underscores the importance of controlled model growth in scenarios where the entire model is used for inference across evolving tasks. (Best viewed in color.)
  • Figure 2: SparseGrow training process overview. Due to the high complexity or dissimilarity of the new dataset (blue), the model's capacity limits its performance. Hence, there is a need to expand the model to enhance its adaptability and to reserve space for future data. Sparse training with freezing is used throughout the training process. Sparsity level and freeze mask updates as model expands.
  • Figure 3: Average accuracy fluctuation of different methods across an increasing number of observed domains. Methods include continual learning baselines, model growth techniques, and continual learning methods+LayExp. Rehearsal methods like LwF and PRE-DFKD initially decline when combined with LayExp, indicating potential unsuitability for directly applying LayExp and the risk of increased growth-induced forgetting; later, LayExp positive effects outweigh the negative impact of growth-induced forgetting as domains increase. SparseGrow excels in knowledge retention, improving effectiveness as the number of domains rises.
  • Figure 4: Epoch-wise average accuracy of observed domains using different continual learning methods with layer expansion on FreshStale datasets with six sequential domains.
  • Figure 5: Epoch-wise average accuracy of observed domains using different continual learning methods with layer expansion on DomainNet datasets with four sequential domains.
  • ...and 3 more figures