Table of Contents
Fetching ...

Dynamic Transformer Architecture for Continual Learning of Multimodal Tasks

Yuliang Cai, Mohammad Rostami

TL;DR

This work tackles continual learning for Vision-and-Language tasks using a transformer backbone. It introduces TAM-CL, a dynamically expanding architecture that adds per-task attention tokens and task-specific heads, aided by intermediate knowledge distillation from a frozen teacher and experience replay to mitigate catastrophic forgetting. The method demonstrates state-of-the-art performance across five multimodal datasets and shows robustness to task order and hyper-parameter settings, while maintaining low memory and time overhead. The approach advances practical multimodal continual learning by enabling cross-task knowledge transfer on a shared transformer, with potential applicability to edge devices.

Abstract

Transformer neural networks are increasingly replacing prior architectures in a wide range of applications in different data modalities. The increasing size and computational demands of fine-tuning large pre-trained transformer neural networks pose significant challenges for the widespread adoption of these models for applications that demand on-edge computing. To tackle this challenge, continual learning (CL) emerges as a solution by facilitating the transfer of knowledge across tasks that arrive sequentially for an autonomously learning agent. However, current CL methods mainly focus on learning tasks that are exclusively vision-based or language-based. We propose a transformer-based CL framework focusing on learning tasks that involve both vision and language, known as Vision-and-Language (VaL) tasks. Due to the success of transformers in other modalities, our architecture has the potential to be used in multimodal learning settings. In our framework, we benefit from introducing extra parameters to a base transformer to specialize the network for each task. As a result, we enable dynamic model expansion to learn several tasks in a sequence. We also use knowledge distillation to benefit from relevant past experiences to learn the current task more efficiently. Our proposed method, Task Attentive Multimodal Continual Learning (TAM-CL), allows for the exchange of information between tasks while mitigating the problem of catastrophic forgetting. Notably, our approach is scalable, incurring minimal memory and time overhead. TAM-CL achieves state-of-the-art (SOTA) performance on challenging multimodal tasks

Dynamic Transformer Architecture for Continual Learning of Multimodal Tasks

TL;DR

This work tackles continual learning for Vision-and-Language tasks using a transformer backbone. It introduces TAM-CL, a dynamically expanding architecture that adds per-task attention tokens and task-specific heads, aided by intermediate knowledge distillation from a frozen teacher and experience replay to mitigate catastrophic forgetting. The method demonstrates state-of-the-art performance across five multimodal datasets and shows robustness to task order and hyper-parameter settings, while maintaining low memory and time overhead. The approach advances practical multimodal continual learning by enabling cross-task knowledge transfer on a shared transformer, with potential applicability to edge devices.

Abstract

Transformer neural networks are increasingly replacing prior architectures in a wide range of applications in different data modalities. The increasing size and computational demands of fine-tuning large pre-trained transformer neural networks pose significant challenges for the widespread adoption of these models for applications that demand on-edge computing. To tackle this challenge, continual learning (CL) emerges as a solution by facilitating the transfer of knowledge across tasks that arrive sequentially for an autonomously learning agent. However, current CL methods mainly focus on learning tasks that are exclusively vision-based or language-based. We propose a transformer-based CL framework focusing on learning tasks that involve both vision and language, known as Vision-and-Language (VaL) tasks. Due to the success of transformers in other modalities, our architecture has the potential to be used in multimodal learning settings. In our framework, we benefit from introducing extra parameters to a base transformer to specialize the network for each task. As a result, we enable dynamic model expansion to learn several tasks in a sequence. We also use knowledge distillation to benefit from relevant past experiences to learn the current task more efficiently. Our proposed method, Task Attentive Multimodal Continual Learning (TAM-CL), allows for the exchange of information between tasks while mitigating the problem of catastrophic forgetting. Notably, our approach is scalable, incurring minimal memory and time overhead. TAM-CL achieves state-of-the-art (SOTA) performance on challenging multimodal tasks
Paper Structure (29 sections, 12 equations, 2 figures, 9 tables, 2 algorithms)

This paper contains 29 sections, 12 equations, 2 figures, 9 tables, 2 algorithms.

Figures (2)

  • Figure 1: The proposed CL training procedure: (1) A small portion of the data for previous tasks are randomly selected and stored in a memory buffer. (2) The current task arrives with $\mathcal{D}^i$. (3) The training data $\mathcal{D}^i$ is used as input to the teacher model to compute the distillation loss. (4) The memory buffer samples are replayed along with the current task data to train the main model. (5) After learning the current task, the teacher model of the next task will be a copy of the current model.
  • Figure 2: The proposed transformer-based architecture: (left) The VaL inputs are converted into two sequences and then fed into the self-attention layers to generate a fused global feature vector. The data feature vector is then concatenated with the learnable task-specific tokens and then fed into the task attention layer to generate the input for the task-specific classifier heads. The same VaL inputs are also fed into the teacher model's transformer architecture to compute Knowledge Distillation. (right) The task-attention block architecture.