CSTA: Spatial-Temporal Causal Adaptive Learning for Exemplar-Free Video Class-Incremental Learning
Tieyuan Chen, Huabin Liu, Chern Hong Lim, John See, Xing Gao, Junhui Hou, Weiyao Lin
TL;DR
This paper tackles video class-incremental learning without exemplars, a setting that must preserve both spatial appearances and temporal action dynamics. It introduces CSTA, a lightweight adapter-based framework with separate spatial and temporal adapters, augmented by two causal mechanisms—causal recovery via relation distillation and causal compensation to mitigate conflicts between increment and memorization. Built on TimeSformer, CSTA trains only new adapters and a task-specific classifier, while employing cross-task attention and logit distillation to preserve past knowledge. Empirical results on vCLIMB/TCD benchmarks show state-of-the-art accuracy gains up to 4.2% and a storage reduction of up to 61.9%, confirming the approach’s efficiency and robustness for practical video continual learning.
Abstract
Continual learning aims to acquire new knowledge while retaining past information. Class-incremental learning (CIL) presents a challenging scenario where classes are introduced sequentially. For video data, the task becomes more complex than image data because it requires learning and preserving both spatial appearance and temporal action involvement. To address this challenge, we propose a novel exemplar-free framework that equips separate spatiotemporal adapters to learn new class patterns, accommodating the incremental information representation requirements unique to each class. While separate adapters are proven to mitigate forgetting and fit unique requirements, naively applying them hinders the intrinsic connection between spatial and temporal information increments, affecting the efficiency of representing newly learned class information. Motivated by this, we introduce two key innovations from a causal perspective. First, a causal distillation module is devised to maintain the relation between spatial-temporal knowledge for a more efficient representation. Second, a causal compensation mechanism is proposed to reduce the conflicts during increment and memorization between different types of information. Extensive experiments conducted on benchmark datasets demonstrate that our framework can achieve new state-of-the-art results, surpassing current example-based methods by 4.2% in accuracy on average.
