Table of Contents
Fetching ...

CSTA: Spatial-Temporal Causal Adaptive Learning for Exemplar-Free Video Class-Incremental Learning

Tieyuan Chen, Huabin Liu, Chern Hong Lim, John See, Xing Gao, Junhui Hou, Weiyao Lin

TL;DR

This paper tackles video class-incremental learning without exemplars, a setting that must preserve both spatial appearances and temporal action dynamics. It introduces CSTA, a lightweight adapter-based framework with separate spatial and temporal adapters, augmented by two causal mechanisms—causal recovery via relation distillation and causal compensation to mitigate conflicts between increment and memorization. Built on TimeSformer, CSTA trains only new adapters and a task-specific classifier, while employing cross-task attention and logit distillation to preserve past knowledge. Empirical results on vCLIMB/TCD benchmarks show state-of-the-art accuracy gains up to 4.2% and a storage reduction of up to 61.9%, confirming the approach’s efficiency and robustness for practical video continual learning.

Abstract

Continual learning aims to acquire new knowledge while retaining past information. Class-incremental learning (CIL) presents a challenging scenario where classes are introduced sequentially. For video data, the task becomes more complex than image data because it requires learning and preserving both spatial appearance and temporal action involvement. To address this challenge, we propose a novel exemplar-free framework that equips separate spatiotemporal adapters to learn new class patterns, accommodating the incremental information representation requirements unique to each class. While separate adapters are proven to mitigate forgetting and fit unique requirements, naively applying them hinders the intrinsic connection between spatial and temporal information increments, affecting the efficiency of representing newly learned class information. Motivated by this, we introduce two key innovations from a causal perspective. First, a causal distillation module is devised to maintain the relation between spatial-temporal knowledge for a more efficient representation. Second, a causal compensation mechanism is proposed to reduce the conflicts during increment and memorization between different types of information. Extensive experiments conducted on benchmark datasets demonstrate that our framework can achieve new state-of-the-art results, surpassing current example-based methods by 4.2% in accuracy on average.

CSTA: Spatial-Temporal Causal Adaptive Learning for Exemplar-Free Video Class-Incremental Learning

TL;DR

This paper tackles video class-incremental learning without exemplars, a setting that must preserve both spatial appearances and temporal action dynamics. It introduces CSTA, a lightweight adapter-based framework with separate spatial and temporal adapters, augmented by two causal mechanisms—causal recovery via relation distillation and causal compensation to mitigate conflicts between increment and memorization. Built on TimeSformer, CSTA trains only new adapters and a task-specific classifier, while employing cross-task attention and logit distillation to preserve past knowledge. Empirical results on vCLIMB/TCD benchmarks show state-of-the-art accuracy gains up to 4.2% and a storage reduction of up to 61.9%, confirming the approach’s efficiency and robustness for practical video continual learning.

Abstract

Continual learning aims to acquire new knowledge while retaining past information. Class-incremental learning (CIL) presents a challenging scenario where classes are introduced sequentially. For video data, the task becomes more complex than image data because it requires learning and preserving both spatial appearance and temporal action involvement. To address this challenge, we propose a novel exemplar-free framework that equips separate spatiotemporal adapters to learn new class patterns, accommodating the incremental information representation requirements unique to each class. While separate adapters are proven to mitigate forgetting and fit unique requirements, naively applying them hinders the intrinsic connection between spatial and temporal information increments, affecting the efficiency of representing newly learned class information. Motivated by this, we introduce two key innovations from a causal perspective. First, a causal distillation module is devised to maintain the relation between spatial-temporal knowledge for a more efficient representation. Second, a causal compensation mechanism is proposed to reduce the conflicts during increment and memorization between different types of information. Extensive experiments conducted on benchmark datasets demonstrate that our framework can achieve new state-of-the-art results, surpassing current example-based methods by 4.2% in accuracy on average.
Paper Structure (21 sections, 12 equations, 20 figures, 6 tables)

This paper contains 21 sections, 12 equations, 20 figures, 6 tables.

Figures (20)

  • Figure 1: Framework and Relation Experiment. In sub-figure (a), our method utilizes lightweight adapters to accomplish an examplar-free framework. In sub-figure (b), to reach a more efficient representation during adaptation, we analyze the relation between spatial/temporal increment and memorization by making a comparison between introducing an adapter or not, where S-MSA and T-MSA represent spatial and temporal multi-head self-attention, S-Ada and T-Ada represent adaptation modules, alignment indicates aligning classification results with and without adaptation via KL-loss. Through analysis of the optimization directions, as shown in sub-figure (c), we observe that spatial and temporal increments become increasingly irrelevant, and conflicts emerge between the increment of one’s knowledge and the memorization of the other’s.
  • Figure 2: CSTA Structure. The left is TimeSformer architecture, while the middle shows the overall structure of our whole spatial-temporal causal adaptation block. The detailed structures of the adaptation module and adapter module are shown on the right, with only the modules marked with a tuned flag being learnable. The causal recovery loss ensures the effective representation of new knowledge through relation recovery between spatial and temporal knowledge, and the conflict compensation effect is incorporated into the classification results from the main branch to enhance memorization. The causal recovery loss and conflict compensation mechanism are employed only during the training phase.
  • Figure 3: Visualization. Example class (a) is introduced in the new task, (a1) indicates the attention map learned with adapters, while (a2) indicates the attention map extracted by the model trained in old tasks, showing the learning ability of the adapter. In contrast, example class (b) is introduced in the previous task, (b1) indicates the attention map learned with adapters, while (b2) is the attention map extracted by the model with all parameters trainable, showing the memorization ability of the adapter.
  • Figure 4: Motivations. (a) shows the relation between spatial/temporal increment/memorization in the training process, the blue line shows the cosine value between spatial increment and temporal increment in approaching zero which indicates becoming irrelevant. While the cosine value in green line and yellow line below zero indicates the conflicts' existence, and the red line indicates the benefit between memorizations can be better used for memory enhancement; (b1) shows the action "Pulling something from left to right", and (b2) shows the action "Pulling something from right to left", which exhibit significant different temporal information, in contrast, (c1) shows the action "Beach soccer", and (c2) shows the action "Futsal", which exhibit significant different spatial information, so different needs for representation drive our adaptation design to memorize information separately in video CIL.
  • Figure 5: Causal Methods. Fig.(a) represents the detailed implementation of relation recovery for the effective representation of new knowledge, while Fig.(b) represents the causal compensation implementation for conflict relieving.
  • ...and 15 more figures