Table of Contents
Fetching ...

Slightly Shift New Classes to Remember Old Classes for Video Class-Incremental Learning

Jian Jiao, Yu Dai, Hefei Mei, Heqian Qiu, Chuanyang Gong, Shiyuan Tang, Xinpeng Hao, Hongliang Li

TL;DR

SNRO tackles catastrophic forgetting in video class-incremental learning under fixed memory by subtly shifting learning toward low-semantic features through Examples Sparse ($ES$) and preventing overfitting via Early Break ($EB$). It down-samples old-class videos to create larger memory sets and uses Frame Alignment to preserve compatibility with the network input, while EB stops training early to avoid over-optimizing for new classes. Across UCF101, HMDB51, and UESTC-MMEA-CL with the same memory budget, SNRO delivers higher final-task accuracy and lower forgetting than prior memory-replay approaches, demonstrating memory-efficient retention of old classes in sequential video recognition. Overall, SNRO provides a practical strategy to balance old and new class knowledge with limited storage, improving long-term performance in video CLASS-incremental learning.

Abstract

Recent video class-incremental learning usually excessively pursues the accuracy of the newly seen classes and relies on memory sets to mitigate catastrophic forgetting of the old classes. However, limited storage only allows storing a few representative videos. So we propose SNRO, which slightly shifts the features of new classes to remember old classes. Specifically, SNRO contains Examples Sparse(ES) and Early Break(EB). ES decimates at a lower sample rate to build memory sets and uses interpolation to align those sparse frames in the future. By this, SNRO stores more examples under the same memory consumption and forces the model to focus on low-semantic features which are harder to be forgotten. EB terminates the training at a small epoch, preventing the model from overstretching into the high-semantic space of the current task. Experiments on UCF101, HMDB51, and UESTC-MMEA-CL datasets show that SNRO performs better than other approaches while consuming the same memory consumption.

Slightly Shift New Classes to Remember Old Classes for Video Class-Incremental Learning

TL;DR

SNRO tackles catastrophic forgetting in video class-incremental learning under fixed memory by subtly shifting learning toward low-semantic features through Examples Sparse () and preventing overfitting via Early Break (). It down-samples old-class videos to create larger memory sets and uses Frame Alignment to preserve compatibility with the network input, while EB stops training early to avoid over-optimizing for new classes. Across UCF101, HMDB51, and UESTC-MMEA-CL with the same memory budget, SNRO delivers higher final-task accuracy and lower forgetting than prior memory-replay approaches, demonstrating memory-efficient retention of old classes in sequential video recognition. Overall, SNRO provides a practical strategy to balance old and new class knowledge with limited storage, improving long-term performance in video CLASS-incremental learning.

Abstract

Recent video class-incremental learning usually excessively pursues the accuracy of the newly seen classes and relies on memory sets to mitigate catastrophic forgetting of the old classes. However, limited storage only allows storing a few representative videos. So we propose SNRO, which slightly shifts the features of new classes to remember old classes. Specifically, SNRO contains Examples Sparse(ES) and Early Break(EB). ES decimates at a lower sample rate to build memory sets and uses interpolation to align those sparse frames in the future. By this, SNRO stores more examples under the same memory consumption and forces the model to focus on low-semantic features which are harder to be forgotten. EB terminates the training at a small epoch, preventing the model from overstretching into the high-semantic space of the current task. Experiments on UCF101, HMDB51, and UESTC-MMEA-CL datasets show that SNRO performs better than other approaches while consuming the same memory consumption.
Paper Structure (11 sections, 5 equations, 3 figures, 5 tables)

This paper contains 11 sections, 5 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Analysis for memory consumption. $xF\times yV=zMb$ means that sampling $x$ frames for each video from $y$ different videos to store for each class. Assume the resolution of frame is $3\times 224\times 224$, and the total memory consumption is $zMbytes$
  • Figure 2: llustration of the proposed SNRO framework. Note that we also used Examples Sparse in the testing phase
  • Figure 3: $(a)$ and $(b)$ show visualization of GradCAM maps in different tasks of a video labeled "Biking" in TCD and SNRO. This class appears in task $T_6$. The first row of $(a)$ and $(b)$ are raw frames. The second and the third row are their corresponding GradCAM maps at the end of task $T_6$ and task $T_8$. SNRO converges worse than TCD on the bicycle's feature at the end of $T_6$, which means "Shift New Classes". But is better than TCD at the end of $T_8$, which means "Remember Old Classes."