Continual Learning via Learning a Continual Memory in Vision Transformer
Chinmay Savadikar, Michelle Dai, Tianfu Wu
TL;DR
This work tackles task-incremental continual learning for Vision Transformers by introducing CHEEM, a method that learns a task-synergy memory located in the ViT block output projection after MHSA. CHEEM updates this memory with four operations (Reuse, Adapt, New, Skip) under a hierarchical exploration-exploitation NAS (HEE NAS) framework, enabling structured, task-aware memory growth across streaming tasks. Empirical results on the Visual Domain Decathlon and a 5-Datasets benchmark show CHEEM achieving state-of-the-art average accuracy and reduced forgetting compared to baselines like L2G, L2P, SupSup, EFT, and LL, while incurring modest compute and parameter overhead. The approach demonstrates the viability of dynamic ViT backbones guided by memory-based task synergies, offering a principled path toward resilient, scalable TCL in vision systems; future work will address task-index inference at inference to move toward class-incremental and more flexible deployment.
Abstract
This paper studies task-incremental continual learning (TCL) using Vision Transformers (ViTs). Our goal is to improve the overall streaming-task performance without catastrophic forgetting by learning task synergies (e.g., a new task learns to automatically reuse/adapt modules from previous similar tasks, or to introduce new modules when needed, or to skip some modules when it appears to be an easier task). One grand challenge is how to tame ViTs at streaming diverse tasks in terms of balancing their plasticity and stability in a task-aware way while overcoming the catastrophic forgetting. To address the challenge, we propose a simple yet effective approach that identifies a lightweight yet expressive ``sweet spot'' in the ViT block as the task-synergy memory in TCL. We present a Hierarchical task-synergy Exploration-Exploitation (HEE) sampling based neural architecture search (NAS) method for effectively learning task synergies by structurally updating the identified memory component with respect to four basic operations (reuse, adapt, new and skip) at streaming tasks. The proposed method is thus dubbed as CHEEM (Continual Hierarchical-Exploration-Exploitation Memory). In experiments, we test the proposed CHEEM on the challenging Visual Domain Decathlon (VDD) benchmark and the 5-Dataset benchmark. It obtains consistently better performance than the prior art with sensible CHEEM learned continually.
