Table of Contents
Fetching ...

Continual Learning via Learning a Continual Memory in Vision Transformer

Chinmay Savadikar, Michelle Dai, Tianfu Wu

TL;DR

This work tackles task-incremental continual learning for Vision Transformers by introducing CHEEM, a method that learns a task-synergy memory located in the ViT block output projection after MHSA. CHEEM updates this memory with four operations (Reuse, Adapt, New, Skip) under a hierarchical exploration-exploitation NAS (HEE NAS) framework, enabling structured, task-aware memory growth across streaming tasks. Empirical results on the Visual Domain Decathlon and a 5-Datasets benchmark show CHEEM achieving state-of-the-art average accuracy and reduced forgetting compared to baselines like L2G, L2P, SupSup, EFT, and LL, while incurring modest compute and parameter overhead. The approach demonstrates the viability of dynamic ViT backbones guided by memory-based task synergies, offering a principled path toward resilient, scalable TCL in vision systems; future work will address task-index inference at inference to move toward class-incremental and more flexible deployment.

Abstract

This paper studies task-incremental continual learning (TCL) using Vision Transformers (ViTs). Our goal is to improve the overall streaming-task performance without catastrophic forgetting by learning task synergies (e.g., a new task learns to automatically reuse/adapt modules from previous similar tasks, or to introduce new modules when needed, or to skip some modules when it appears to be an easier task). One grand challenge is how to tame ViTs at streaming diverse tasks in terms of balancing their plasticity and stability in a task-aware way while overcoming the catastrophic forgetting. To address the challenge, we propose a simple yet effective approach that identifies a lightweight yet expressive ``sweet spot'' in the ViT block as the task-synergy memory in TCL. We present a Hierarchical task-synergy Exploration-Exploitation (HEE) sampling based neural architecture search (NAS) method for effectively learning task synergies by structurally updating the identified memory component with respect to four basic operations (reuse, adapt, new and skip) at streaming tasks. The proposed method is thus dubbed as CHEEM (Continual Hierarchical-Exploration-Exploitation Memory). In experiments, we test the proposed CHEEM on the challenging Visual Domain Decathlon (VDD) benchmark and the 5-Dataset benchmark. It obtains consistently better performance than the prior art with sensible CHEEM learned continually.

Continual Learning via Learning a Continual Memory in Vision Transformer

TL;DR

This work tackles task-incremental continual learning for Vision Transformers by introducing CHEEM, a method that learns a task-synergy memory located in the ViT block output projection after MHSA. CHEEM updates this memory with four operations (Reuse, Adapt, New, Skip) under a hierarchical exploration-exploitation NAS (HEE NAS) framework, enabling structured, task-aware memory growth across streaming tasks. Empirical results on the Visual Domain Decathlon and a 5-Datasets benchmark show CHEEM achieving state-of-the-art average accuracy and reduced forgetting compared to baselines like L2G, L2P, SupSup, EFT, and LL, while incurring modest compute and parameter overhead. The approach demonstrates the viability of dynamic ViT backbones guided by memory-based task synergies, offering a principled path toward resilient, scalable TCL in vision systems; future work will address task-index inference at inference to move toward class-incremental and more flexible deployment.

Abstract

This paper studies task-incremental continual learning (TCL) using Vision Transformers (ViTs). Our goal is to improve the overall streaming-task performance without catastrophic forgetting by learning task synergies (e.g., a new task learns to automatically reuse/adapt modules from previous similar tasks, or to introduce new modules when needed, or to skip some modules when it appears to be an easier task). One grand challenge is how to tame ViTs at streaming diverse tasks in terms of balancing their plasticity and stability in a task-aware way while overcoming the catastrophic forgetting. To address the challenge, we propose a simple yet effective approach that identifies a lightweight yet expressive ``sweet spot'' in the ViT block as the task-synergy memory in TCL. We present a Hierarchical task-synergy Exploration-Exploitation (HEE) sampling based neural architecture search (NAS) method for effectively learning task synergies by structurally updating the identified memory component with respect to four basic operations (reuse, adapt, new and skip) at streaming tasks. The proposed method is thus dubbed as CHEEM (Continual Hierarchical-Exploration-Exploitation Memory). In experiments, we test the proposed CHEEM on the challenging Visual Domain Decathlon (VDD) benchmark and the 5-Dataset benchmark. It obtains consistently better performance than the prior art with sensible CHEEM learned continually.
Paper Structure (35 sections, 7 equations, 17 figures, 15 tables)

This paper contains 35 sections, 7 equations, 17 figures, 15 tables.

Figures (17)

  • Figure 1: Comparisons of different continual learning methods using Vision Transformers vitattention-is-all-you-need. (a) Prompt-based methodslearning-to-promptdualprompts-promptscoda-prompt leverage a pretrained and frozen Transformer and learn task-specific prompts. (b) Parameter-tuning based methods introduce task-specific layer-wise parameters on top of a pretrained and frozen Transformer, and are different in how the pretrained layer and the task-specific layer are fused, e.g., the parameter-masking methods supsupmeta-attentionpiggyback and the output-addition methods llclrpool-of-adaptersnettailor. They often introduce task-specific layers at every pretrained layer. (c) Our proposed method utilizes four operations to sequentially and continually maintain task-synergy memory: Reuse, New, Adapt and Skip at streaming tasks.
  • Figure 2: Illustration of task-incremental continual learning on the Visual Domain Decathlon (VDD) vdd benchmark, which consists of 10 tasks with #training images/#categories significantly varying across different tasks. As commonly adopted in the literature, we assume the first (base) task has sufficient data to train a base model, for which we use the ImageNet-1k in our experiments.
  • Figure 3: Illustration of the proposed CHEEM. Left: A Transformer block in ViTs vit with the proposed CHEEM placed at the original output projection layer after the MHSA. Middle: The CHEEM is maintained by four operations. Right: An example of learned CHEEM for different tasks (e.g., $j$) starting from $i$.
  • Figure 4: An example of the CHEEM learned on the VDD benchmark vdd with the task sequence shown in Fig. \ref{['fig:vdd-dataset-overview']}. S, R, A and N represent Skip, Reuse, Adapt and New respectively. Starting from the ImageNet-trained ViT vit (B1 --- B12 in Tsk1_ImNet), sensible structures are continually learned for the subsequent 9 tasks. The last two columns show the number of new task-specific parameters and added FLOPs respectively, in comparison with the first task, ImNet model. See text for details and Appendix \ref{['sec:other-architectures']} for more examples. Best viewed in magnification.
  • Figure 5: Illustration of CHEEM learning via NAS.
  • ...and 12 more figures