StPR: Spatiotemporal Preservation and Routing for Exemplar-Free Video Class-Incremental Learning
Huaijie Wang, De Cheng, Guozhang Li, Zhipeng Xu, Lingfeng He, Jie Li, Nannan Wang, Xinbo Gao
TL;DR
StPR tackles Video Class-Incremental Learning without exemplars by explicitly preserving spatiotemporal information. It combines Frame-Shared Semantics Distillation to selectively stabilize spatial channels with Temporal Decomposition-based Mixture-of-Experts to dynamically route based on temporal cues, enabling task-id-free inference. The approach leverages a frozen CLIP backbone with lightweight adapters and task-specific spatiotemporal experts, optimized via dual contrastive losses and a targeted FSSD regularization. Empirical results on UCF101, HMDB51, SSv2, and Kinetics400 show state-of-the-art performance with improved stability-plasticity balance and efficiency, highlighting practical impact for privacy-preserving, continual video understanding.
Abstract
Video Class-Incremental Learning (VCIL) seeks to develop models that continuously learn new action categories over time without forgetting previously acquired knowledge. Unlike traditional Class-Incremental Learning (CIL), VCIL introduces the added complexity of spatiotemporal structures, making it particularly challenging to mitigate catastrophic forgetting while effectively capturing both frame-shared semantics and temporal dynamics. Existing approaches either rely on exemplar rehearsal, raising concerns over memory and privacy, or adapt static image-based methods that neglect temporal modeling. To address these limitations, we propose Spatiotemporal Preservation and Routing (StPR), a unified and exemplar-free VCIL framework that explicitly disentangles and preserves spatiotemporal information. First, we introduce Frame-Shared Semantics Distillation (FSSD), which identifies semantically stable and meaningful channels by jointly considering semantic sensitivity and classification contribution. These important semantic channels are selectively regularized to maintain prior knowledge while allowing for adaptation. Second, we design a Temporal Decomposition-based Mixture-of-Experts (TD-MoE), which dynamically routes task-specific experts based on their temporal dynamics, enabling inference without task ID or stored exemplars. Together, StPR effectively leverages spatial semantics and temporal dynamics, achieving a unified, exemplar-free VCIL framework. Extensive experiments on UCF101, HMDB51, and Kinetics400 show that our method outperforms existing baselines while offering improved interpretability and efficiency in VCIL. Code is available in the supplementary materials.
