Table of Contents
Fetching ...

Self-Controlled Dynamic Expansion Model for Continual Learning

Runqing Wu, Kaihui Huang, Hanyi Zhang, Fei Ye

TL;DR

This paper tackles catastrophic forgetting and limited plasticity in continual learning by introducing SCDEM, a self-controlled dynamic expansion model that uses multiple pre-trained ViT backbones to provide diverse representations and dynamically create lightweight task-specific experts for new tasks. The approach integrates three mechanisms: Collaborative Optimization Mechanism ($\mathcal{L}_{\mathrm{COM}}$) to align predictions across old and new backbones via KL divergence, Feature Distribution Consistency (FDC) to stabilize semantic representations using Wasserstein distance, and Dynamic Layer-Wise Feature Attention Mechanism (DLWFAM) to adapt regularization across layers. A task-free expert selection strategy enables class-IL inference without task labels, and a four-step training procedure updates the new expert while preserving prior knowledge through backbone snapshots. Empirical results on multi-domain benchmarks show SCDEM achieving state-of-the-art performance with improved efficiency over existing methods, highlighting its practical potential for scalable, memory-efficient continual learning in ViT-based systems.

Abstract

Continual Learning (CL) epitomizes an advanced training paradigm wherein prior data samples remain inaccessible during the acquisition of new tasks. Numerous investigations have delved into leveraging a pre-trained Vision Transformer (ViT) to enhance model efficacy in continual learning. Nonetheless, these approaches typically utilize a singular, static backbone, which inadequately adapts to novel tasks, particularly when engaging with diverse data domains, due to a substantial number of inactive parameters. This paper addresses this limitation by introducing an innovative Self-Controlled Dynamic Expansion Model (SCDEM), which orchestrates multiple distinct trainable pre-trained ViT backbones to furnish diverse and semantically enriched representations. Specifically, by employing the multi-backbone architecture as a shared module, the proposed SCDEM dynamically generates a new expert with minimal parameters to accommodate a new task. A novel Collaborative Optimization Mechanism (COM) is introduced to synergistically optimize multiple backbones by harnessing prediction signals from historical experts, thereby facilitating new task learning without erasing previously acquired knowledge. Additionally, a novel Feature Distribution Consistency (FDC) approach is proposed to align semantic similarity between previously and currently learned representations through an optimal transport distance-based mechanism, effectively mitigating negative knowledge transfer effects. Furthermore, to alleviate over-regularization challenges, this paper presents a novel Dynamic Layer-Wise Feature Attention Mechanism (DLWFAM) to autonomously determine the penalization intensity on each trainable representation layer. An extensive series of experiments have been conducted to evaluate the proposed methodology's efficacy, with empirical results corroborating that the approach attains state-of-the-art performance.

Self-Controlled Dynamic Expansion Model for Continual Learning

TL;DR

This paper tackles catastrophic forgetting and limited plasticity in continual learning by introducing SCDEM, a self-controlled dynamic expansion model that uses multiple pre-trained ViT backbones to provide diverse representations and dynamically create lightweight task-specific experts for new tasks. The approach integrates three mechanisms: Collaborative Optimization Mechanism () to align predictions across old and new backbones via KL divergence, Feature Distribution Consistency (FDC) to stabilize semantic representations using Wasserstein distance, and Dynamic Layer-Wise Feature Attention Mechanism (DLWFAM) to adapt regularization across layers. A task-free expert selection strategy enables class-IL inference without task labels, and a four-step training procedure updates the new expert while preserving prior knowledge through backbone snapshots. Empirical results on multi-domain benchmarks show SCDEM achieving state-of-the-art performance with improved efficiency over existing methods, highlighting its practical potential for scalable, memory-efficient continual learning in ViT-based systems.

Abstract

Continual Learning (CL) epitomizes an advanced training paradigm wherein prior data samples remain inaccessible during the acquisition of new tasks. Numerous investigations have delved into leveraging a pre-trained Vision Transformer (ViT) to enhance model efficacy in continual learning. Nonetheless, these approaches typically utilize a singular, static backbone, which inadequately adapts to novel tasks, particularly when engaging with diverse data domains, due to a substantial number of inactive parameters. This paper addresses this limitation by introducing an innovative Self-Controlled Dynamic Expansion Model (SCDEM), which orchestrates multiple distinct trainable pre-trained ViT backbones to furnish diverse and semantically enriched representations. Specifically, by employing the multi-backbone architecture as a shared module, the proposed SCDEM dynamically generates a new expert with minimal parameters to accommodate a new task. A novel Collaborative Optimization Mechanism (COM) is introduced to synergistically optimize multiple backbones by harnessing prediction signals from historical experts, thereby facilitating new task learning without erasing previously acquired knowledge. Additionally, a novel Feature Distribution Consistency (FDC) approach is proposed to align semantic similarity between previously and currently learned representations through an optimal transport distance-based mechanism, effectively mitigating negative knowledge transfer effects. Furthermore, to alleviate over-regularization challenges, this paper presents a novel Dynamic Layer-Wise Feature Attention Mechanism (DLWFAM) to autonomously determine the penalization intensity on each trainable representation layer. An extensive series of experiments have been conducted to evaluate the proposed methodology's efficacy, with empirical results corroborating that the approach attains state-of-the-art performance.

Paper Structure

This paper contains 14 sections, 12 equations, 3 figures, 6 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overview of the SCDEM training framework. (a) Initial task stage: (i) Each backbone $f_{\theta_j}$ is partially fine-tuned to extract multi-source features ${\bf z}^f$, which are used to train a task-specific expert $\mathcal{E}_t = \{f_{\xi_t}, f_{\omega_t}\}$. (ii) Backbone copies $\hat{f}_{\theta_j}$ are frozen to retain prior knowledge. (b) Continual learning stage: (iii) A selector $g_{\phi_t}$ assigns layer-wise weights to compute ${\bf Z}^{\text{fused}}_j$, aligned with its frozen counterpart via Wasserstein distance. (iv) Knowledge consistency is enforced through KL divergence between expert outputs ($\mathcal{L}_{\rm COM}$), and task-specific supervision is applied via cross-entropy loss ($\mathcal{L}_{\rm CE}$).
  • Figure 2: (a) Selector-weighted fusion(DLWFAM): layer-wise features from $f_{\theta_j}$ are aggregated via attention weights $\{\alpha_k\}$ to form ${\bf Z}^{\text{fused}}_j$, aligned with the frozen $\hat{\bf Z}^{\text{fused}}_j$ via Wasserstein distance. (b) Task-free expert selection: each expert is scored by combining prediction entropy and KL divergence between its log-likelihood and a global softmax distribution, enabling class-IL inference without task labels.
  • Figure 3: (a) and (b) illustrate the feature distributions of the final layer from the dual-backbone network using t-SNE and UMAP, respectively. (c) and (d) compare the cosine distance statistics between the output features and the baseline.