Exemplar Masking for Multimodal Incremental Learning
Yi-Lun Lee, Chen-Yu Lee, Wei-Chen Chiu, Yi-Hsuan Tsai
TL;DR
This work tackles multimodal class-incremental learning (MCIL) under tight memory and computation by introducing an exemplar masking framework that preserves only discriminative tokens from both image and text modalities using class-token attention, while retaining contextual information through cross-modal cues. A parameter-efficient tuning approach (SSF) and a multimodal data augmentation strategy further enhance replay quality, enabling more old-class exemplars within the same memory budget. The authors extend ImageNet-R into a multimodal MM-ImageNet-R by generating captions with InstructBLIP, then demonstrate improved accuracy and robustness against forgetting on MM-ImageNet-R and UPMC Food-101 across multiple incremental phases. The findings indicate substantial memory savings (fewer stored tokens per exemplar) and stronger replay performance, offering a scalable solution for deploying large multimodal models in continual learning settings.
Abstract
Multimodal incremental learning needs to digest the information from multiple modalities while concurrently learning new knowledge without forgetting the previously learned information. There are numerous challenges for this task, mainly including the larger storage size of multimodal data in exemplar-based methods and the computational requirement of finetuning on huge multimodal models. In this paper, we leverage the parameter-efficient tuning scheme to reduce the burden of fine-tuning and propose the exemplar masking framework to efficiently replay old knowledge. Specifically, the non-important tokens are masked based on the attention weights and the correlation across different modalities, significantly reducing the storage size of an exemplar and consequently saving more exemplars under the same memory buffer. Moreover, we design a multimodal data augmentation technique to diversify exemplars for replaying prior knowledge. In experiments, we not only evaluate our method in existing multimodal datasets but also extend the ImageNet-R dataset to a multimodal dataset as a real-world application, where captions are generated by querying multimodal large language models (e.g., InstructBLIP). Extensive experiments show that our exemplar masking framework is more efficient and robust to catastrophic forgetting under the same limited memory buffer. Code is available at https://github.com/YiLunLee/Exemplar_Masking_MCIL.
