Table of Contents
Fetching ...

MCF-VC: Mitigate Catastrophic Forgetting in Class-Incremental Learning for Multimodal Video Captioning

Huiyu Xiong, Lanxiao Wang, Heqian Qiu, Taijin Zhao, Benliu Qiu, Hongliang Li

TL;DR

This work tackles catastrophic forgetting in class-incremental multimodal video captioning by introducing MCF-VC, a framework that combines Fine-grained Sensitivity Selection, Glossary Ensemble, and Two-stage Knowledge Distillation. It formulates the incremental captioning problem, redesigns the backbone to accommodate sequential sub-tasks, and enforces knowledge retention through targeted distillation and selective parameter inheritance. Empirical results on MSR-VTT show substantial forgetting resistance without replay, while maintaining strong performance on new tasks; ablations confirm the effectiveness and synergy of the proposed modules. The approach advances continual learning for complex video-language tasks, enabling scalable, memory-efficient incremental captioning in dynamic real-world settings.

Abstract

To address the problem of catastrophic forgetting due to the invisibility of old categories in sequential input, existing work based on relatively simple categorization tasks has made some progress. In contrast, video captioning is a more complex task in multimodal scenario, which has not been explored in the field of incremental learning. After identifying this stability-plasticity problem when analyzing video with sequential input, we originally propose a method to Mitigate Catastrophic Forgetting in class-incremental learning for multimodal Video Captioning (MCF-VC). As for effectively maintaining good performance on old tasks at the macro level, we design Fine-grained Sensitivity Selection (FgSS) based on the Mask of Linear's Parameters and Fisher Sensitivity to pick useful knowledge from old tasks. Further, in order to better constrain the knowledge characteristics of old and new tasks at the specific feature level, we have created the Two-stage Knowledge Distillation (TsKD), which is able to learn the new task well while weighing the old task. Specifically, we design two distillation losses, which constrain the cross modal semantic information of semantic attention feature map and the textual information of the final outputs respectively, so that the inter-model and intra-model stylized knowledge of the old class is retained while learning the new class. In order to illustrate the ability of our model to resist forgetting, we designed a metric CIDER_t to detect the stage forgetting rate. Our experiments on the public dataset MSR-VTT show that the proposed method significantly resists the forgetting of previous tasks without replaying old samples, and performs well on the new task.

MCF-VC: Mitigate Catastrophic Forgetting in Class-Incremental Learning for Multimodal Video Captioning

TL;DR

This work tackles catastrophic forgetting in class-incremental multimodal video captioning by introducing MCF-VC, a framework that combines Fine-grained Sensitivity Selection, Glossary Ensemble, and Two-stage Knowledge Distillation. It formulates the incremental captioning problem, redesigns the backbone to accommodate sequential sub-tasks, and enforces knowledge retention through targeted distillation and selective parameter inheritance. Empirical results on MSR-VTT show substantial forgetting resistance without replay, while maintaining strong performance on new tasks; ablations confirm the effectiveness and synergy of the proposed modules. The approach advances continual learning for complex video-language tasks, enabling scalable, memory-efficient incremental captioning in dynamic real-world settings.

Abstract

To address the problem of catastrophic forgetting due to the invisibility of old categories in sequential input, existing work based on relatively simple categorization tasks has made some progress. In contrast, video captioning is a more complex task in multimodal scenario, which has not been explored in the field of incremental learning. After identifying this stability-plasticity problem when analyzing video with sequential input, we originally propose a method to Mitigate Catastrophic Forgetting in class-incremental learning for multimodal Video Captioning (MCF-VC). As for effectively maintaining good performance on old tasks at the macro level, we design Fine-grained Sensitivity Selection (FgSS) based on the Mask of Linear's Parameters and Fisher Sensitivity to pick useful knowledge from old tasks. Further, in order to better constrain the knowledge characteristics of old and new tasks at the specific feature level, we have created the Two-stage Knowledge Distillation (TsKD), which is able to learn the new task well while weighing the old task. Specifically, we design two distillation losses, which constrain the cross modal semantic information of semantic attention feature map and the textual information of the final outputs respectively, so that the inter-model and intra-model stylized knowledge of the old class is retained while learning the new class. In order to illustrate the ability of our model to resist forgetting, we designed a metric CIDER_t to detect the stage forgetting rate. Our experiments on the public dataset MSR-VTT show that the proposed method significantly resists the forgetting of previous tasks without replaying old samples, and performs well on the new task.
Paper Structure (31 sections, 13 equations, 7 figures, 7 tables)

This paper contains 31 sections, 13 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Comparison of traditional video captioning and incremental video captioning. In traditional video captioning (a), all visual-language pairs are obtained during training. However, in incremental tasks (b), new class orders arrive while none of the old data is visible. Therefore, the model needs to be updated to learn the new class while maintaining the accuracy of the old class.
  • Figure 2: The graphical illustration of our approach MCF-VC for the new class-incremental task. Enter the dynamic ${\mathcal{F}}^{3D}$ and static $\mathcal{F}^{2D}$ visual features of the new task and their corresponding text from the data pre-processing into a modified video caption backbone. In order to be more suitable for the task of class-increment, we design FgSS, which masks the fine-grained gradient information in three steps, so that the new model can obtain a balanced effect on the new and old tasks. Next, the cross-modal feature map $(S_A, S'_A)$ extracted by semantic attention and the final output text features $(\mathbf{W}_{t+1}, \mathbf{W}_{t})$ will undergo TsKD.
  • Figure 3: The diagram of the scalable glossary matrix in the sequential tasks. As task t+1 is entered, the glossary $\mathcal{G}'$ will gradually grow larger than the old one $\mathcal{G}$.
  • Figure 4: This schematic indicates how the FgSS module handles model parameters.
  • Figure 5: In order to better evaluate the significant performance improvement of the incremental approach proposed in this paper compared to other approaches on martics in natural language processing, we use line graphs to clearly express it.
  • ...and 2 more figures