Table of Contents
Fetching ...

Incrementally Learning Multiple Diverse Data Domains via Multi-Source Dynamic Expansion Model

Runqing Wu, Fei Ye, Qihe Liu, Guoxi Huang, Jinyu Guo, Rongyao Hu

TL;DR

This work tackles multi-domain continual learning by introducing MSDEM, a framework that leverages multiple pre-trained Vision Transformer backbones to form dynamic experts for new tasks. It introduces two key components: Dynamic Expandable Attention Mechanism (DEAM) that selectively gates knowledge from backbones for each task, and Dynamic Graph Weight Router (DGWR) that reuses prior experts through a learnable graph router to maximize transfer while mitigating forgetting. Through experiments on cross-domain datasets, MSDEM achieves state-of-the-art average performance with fewer parameters than competitive baselines, demonstrating strong generalization across domain shifts and class increments. The approach offers a practical pathway to scalable, efficient continual learning in heterogeneous data environments by reusing diverse, pre-trained knowledge sources and adapting only task-specific modules.

Abstract

Continual Learning seeks to develop a model capable of incrementally assimilating new information while retaining prior knowledge. However, current research predominantly addresses a straightforward learning context, wherein all data samples originate from a singular data domain. This paper shifts focus to a more complex and realistic learning environment, characterized by data samples sourced from multiple distinct domains. We tackle this intricate learning challenge by introducing a novel methodology, termed the Multi-Source Dynamic Expansion Model (MSDEM), which leverages various pre-trained models as backbones and progressively establishes new experts based on them to adapt to emerging tasks. Additionally, we propose an innovative dynamic expandable attention mechanism designed to selectively harness knowledge from multiple backbones, thereby accelerating the new task learning. Moreover, we introduce a dynamic graph weight router that strategically reuses all previously acquired parameters and representations for new task learning, maximizing the positive knowledge transfer effect, which further improves generalization performance. We conduct a comprehensive series of experiments, and the empirical findings indicate that our proposed approach achieves state-of-the-art performance.

Incrementally Learning Multiple Diverse Data Domains via Multi-Source Dynamic Expansion Model

TL;DR

This work tackles multi-domain continual learning by introducing MSDEM, a framework that leverages multiple pre-trained Vision Transformer backbones to form dynamic experts for new tasks. It introduces two key components: Dynamic Expandable Attention Mechanism (DEAM) that selectively gates knowledge from backbones for each task, and Dynamic Graph Weight Router (DGWR) that reuses prior experts through a learnable graph router to maximize transfer while mitigating forgetting. Through experiments on cross-domain datasets, MSDEM achieves state-of-the-art average performance with fewer parameters than competitive baselines, demonstrating strong generalization across domain shifts and class increments. The approach offers a practical pathway to scalable, efficient continual learning in heterogeneous data environments by reusing diverse, pre-trained knowledge sources and adapting only task-specific modules.

Abstract

Continual Learning seeks to develop a model capable of incrementally assimilating new information while retaining prior knowledge. However, current research predominantly addresses a straightforward learning context, wherein all data samples originate from a singular data domain. This paper shifts focus to a more complex and realistic learning environment, characterized by data samples sourced from multiple distinct domains. We tackle this intricate learning challenge by introducing a novel methodology, termed the Multi-Source Dynamic Expansion Model (MSDEM), which leverages various pre-trained models as backbones and progressively establishes new experts based on them to adapt to emerging tasks. Additionally, we propose an innovative dynamic expandable attention mechanism designed to selectively harness knowledge from multiple backbones, thereby accelerating the new task learning. Moreover, we introduce a dynamic graph weight router that strategically reuses all previously acquired parameters and representations for new task learning, maximizing the positive knowledge transfer effect, which further improves generalization performance. We conduct a comprehensive series of experiments, and the empirical findings indicate that our proposed approach achieves state-of-the-art performance.
Paper Structure (13 sections, 11 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 13 sections, 11 equations, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overall framework of the proposed method. (a) During the initial training task, task $T_1$ is used as an input sample to multiple backbones, generating individual feature outputs that are concatenated to form a fused feature vector. This fused vector is then processed through a multi-head attention module for feature integration, followed by a classifier head to produce the final result. (b) In the subsequent training tasks across multiple domains, the attention modules from previous tasks are retained as experts and frozen. The output feature vectors from all experts are fused and then fed into a router, where weight allocation and Top-k selection are applied to identify the most important experts for knowledge integration. The resulting fused vector is then processed through graph attention for the final prediction.
  • Figure 2: The expert selection process with different values of $\tau$. (a) When the temperature is low, the selection approaches a one-hot vector, selecting only the expert for the current task. (b) When the temperature is set close to 1, it performs Top-k selection, where $k$ is learned during training rather than manually constrained. (c) When the temperature is high, all experts are selected.
  • Figure 3: In the T-C100-B-C10 task configuration, we compare the forgetting curves of MSDEM against SOTA methods. An additional 3-epoch MSDEM variant is included for comparison with StarPrompt. Our model consistently outperforms all SOTA approaches across all tasks.
  • Figure 4: (a) Feature maps of trained experts across four domains, reduced via PCA. Most domains exhibit mutual independence, with the exception of the relatively close alignment between TinyImageNet and Birds; (b) Heatmap showing the knowledge weights of domain2 on domain1 across 16 permutation schemes, with each domain’s self-dependency set to 1.