Table of Contents
Fetching ...

Exemplar Masking for Multimodal Incremental Learning

Yi-Lun Lee, Chen-Yu Lee, Wei-Chen Chiu, Yi-Hsuan Tsai

TL;DR

This work tackles multimodal class-incremental learning (MCIL) under tight memory and computation by introducing an exemplar masking framework that preserves only discriminative tokens from both image and text modalities using class-token attention, while retaining contextual information through cross-modal cues. A parameter-efficient tuning approach (SSF) and a multimodal data augmentation strategy further enhance replay quality, enabling more old-class exemplars within the same memory budget. The authors extend ImageNet-R into a multimodal MM-ImageNet-R by generating captions with InstructBLIP, then demonstrate improved accuracy and robustness against forgetting on MM-ImageNet-R and UPMC Food-101 across multiple incremental phases. The findings indicate substantial memory savings (fewer stored tokens per exemplar) and stronger replay performance, offering a scalable solution for deploying large multimodal models in continual learning settings.

Abstract

Multimodal incremental learning needs to digest the information from multiple modalities while concurrently learning new knowledge without forgetting the previously learned information. There are numerous challenges for this task, mainly including the larger storage size of multimodal data in exemplar-based methods and the computational requirement of finetuning on huge multimodal models. In this paper, we leverage the parameter-efficient tuning scheme to reduce the burden of fine-tuning and propose the exemplar masking framework to efficiently replay old knowledge. Specifically, the non-important tokens are masked based on the attention weights and the correlation across different modalities, significantly reducing the storage size of an exemplar and consequently saving more exemplars under the same memory buffer. Moreover, we design a multimodal data augmentation technique to diversify exemplars for replaying prior knowledge. In experiments, we not only evaluate our method in existing multimodal datasets but also extend the ImageNet-R dataset to a multimodal dataset as a real-world application, where captions are generated by querying multimodal large language models (e.g., InstructBLIP). Extensive experiments show that our exemplar masking framework is more efficient and robust to catastrophic forgetting under the same limited memory buffer. Code is available at https://github.com/YiLunLee/Exemplar_Masking_MCIL.

Exemplar Masking for Multimodal Incremental Learning

TL;DR

This work tackles multimodal class-incremental learning (MCIL) under tight memory and computation by introducing an exemplar masking framework that preserves only discriminative tokens from both image and text modalities using class-token attention, while retaining contextual information through cross-modal cues. A parameter-efficient tuning approach (SSF) and a multimodal data augmentation strategy further enhance replay quality, enabling more old-class exemplars within the same memory budget. The authors extend ImageNet-R into a multimodal MM-ImageNet-R by generating captions with InstructBLIP, then demonstrate improved accuracy and robustness against forgetting on MM-ImageNet-R and UPMC Food-101 across multiple incremental phases. The findings indicate substantial memory savings (fewer stored tokens per exemplar) and stronger replay performance, offering a scalable solution for deploying large multimodal models in continual learning settings.

Abstract

Multimodal incremental learning needs to digest the information from multiple modalities while concurrently learning new knowledge without forgetting the previously learned information. There are numerous challenges for this task, mainly including the larger storage size of multimodal data in exemplar-based methods and the computational requirement of finetuning on huge multimodal models. In this paper, we leverage the parameter-efficient tuning scheme to reduce the burden of fine-tuning and propose the exemplar masking framework to efficiently replay old knowledge. Specifically, the non-important tokens are masked based on the attention weights and the correlation across different modalities, significantly reducing the storage size of an exemplar and consequently saving more exemplars under the same memory buffer. Moreover, we design a multimodal data augmentation technique to diversify exemplars for replaying prior knowledge. In experiments, we not only evaluate our method in existing multimodal datasets but also extend the ImageNet-R dataset to a multimodal dataset as a real-world application, where captions are generated by querying multimodal large language models (e.g., InstructBLIP). Extensive experiments show that our exemplar masking framework is more efficient and robust to catastrophic forgetting under the same limited memory buffer. Code is available at https://github.com/YiLunLee/Exemplar_Masking_MCIL.

Paper Structure

This paper contains 18 sections, 4 equations, 15 figures, 6 tables, 1 algorithm.

Figures (15)

  • Figure 1: Illustration of exemplar replay for a new class "ice bear". In the conventional exemplar replay framework, only very few data samples can be stored in the limited memory buffer due to the high storage demand. In contrast, our exemplar masking framework preserves the important regions of the image and discards the non-important ones to reduce the storage space. Moreover, we propose to preserve the information of discarded regions via another modality (i.e., text) to retain as much information as possible. Under the same memory buffer, our framework can store more samples, contributing to more effective knowledge replay.
  • Figure 2: Overview of the proposed exemplar masking framework for multimodal class-incremental learning, including exemplar masking, exemplar selection, and multimodal data augmentation. In the $l$-th incremental phase, we first apply multimodal data augmentation on the $l$-th memory buffer and train the model with both new data and augmented exemplars. After training, we generate the masked exemplars from new data via the proposed exemplar masking and exemplar selection methods, then combine them with the memory buffer.
  • Figure 3: The overview of our proposed exemplar masking and exemplar selection methods. Given a training sample $(x_T, x_I)$ of class $c$, we first calculate the attention of the class token for the image modality $A_{CLS \to I}$ and obtain the image mask $M_I$ according to the threshold $\tau_I$. Then the masked image regions are preserved in the masked image $\Tilde{x_I}$ while the others are discarded. To preserve contextual information from the discarded image regions, we calculate the cross attention $A_{I \to T}$ between discarded image tokens and text tokens to obtain the text mask $M_T$ via the threshold $\tau_T$. Hence the masked text $\Tilde{x_T}$ is produced via applying text mask $M_T$ on $x_T$. Finally, we compute the cosine similarity between the feature $f_c(\Tilde{x})$ of the masked sample $(\tilde{x}_T, \tilde{x}_I)$ and the mean $\mu_c$ of the class $c$, in which the samples with the top-$k$ highest similarity are selected as the exemplars of class $c$ and preserved without exceeding the memory size.
  • Figure 4: Examples of the masked exemplars and the corresponding attention maps from different classes, including (a) ice bear, (b) poodle, and (c) grand piano. We denote colors for the masked text in red and the discarded text in gray, while also highlighting the important words related to contextual information as well as class-related information.
  • Figure 5: Experimental results under the different constraints of the memory buffer size. Our proposed method preserves more exemplar samples under the same limited storage space and improves the baseline by a large margin.
  • ...and 10 more figures