Dynamic Transformer Architecture for Continual Learning of Multimodal Tasks
Yuliang Cai, Mohammad Rostami
TL;DR
This work tackles continual learning for Vision-and-Language tasks using a transformer backbone. It introduces TAM-CL, a dynamically expanding architecture that adds per-task attention tokens and task-specific heads, aided by intermediate knowledge distillation from a frozen teacher and experience replay to mitigate catastrophic forgetting. The method demonstrates state-of-the-art performance across five multimodal datasets and shows robustness to task order and hyper-parameter settings, while maintaining low memory and time overhead. The approach advances practical multimodal continual learning by enabling cross-task knowledge transfer on a shared transformer, with potential applicability to edge devices.
Abstract
Transformer neural networks are increasingly replacing prior architectures in a wide range of applications in different data modalities. The increasing size and computational demands of fine-tuning large pre-trained transformer neural networks pose significant challenges for the widespread adoption of these models for applications that demand on-edge computing. To tackle this challenge, continual learning (CL) emerges as a solution by facilitating the transfer of knowledge across tasks that arrive sequentially for an autonomously learning agent. However, current CL methods mainly focus on learning tasks that are exclusively vision-based or language-based. We propose a transformer-based CL framework focusing on learning tasks that involve both vision and language, known as Vision-and-Language (VaL) tasks. Due to the success of transformers in other modalities, our architecture has the potential to be used in multimodal learning settings. In our framework, we benefit from introducing extra parameters to a base transformer to specialize the network for each task. As a result, we enable dynamic model expansion to learn several tasks in a sequence. We also use knowledge distillation to benefit from relevant past experiences to learn the current task more efficiently. Our proposed method, Task Attentive Multimodal Continual Learning (TAM-CL), allows for the exchange of information between tasks while mitigating the problem of catastrophic forgetting. Notably, our approach is scalable, incurring minimal memory and time overhead. TAM-CL achieves state-of-the-art (SOTA) performance on challenging multimodal tasks
