Hierarchical Augmentation and Distillation for Class Incremental Audio-Visual Video Recognition
Yukun Zuo, Hantao Yao, Liansheng Zhuang, Changsheng Xu
TL;DR
This work tackles Class Incremental Audio-Visual Video Recognition (CIAVVR), where models must learn new classes without forgetting old ones in a multimodal setting. It introduces Hierarchical Augmentation and Distillation (HAD), comprising the Hierarchical Augmentation Module (HAM) to diversify past data through segmental feature augmentation and the Hierarchical Distillation Module (HDM) to preserve data knowledge via hierarchical logical and correlative distillations. The methods exploit hierarchical structure in both model and data, using video-distribution proxies and snippet-video correlations to retain intra- and inter-sample knowledge, and provide theoretical support for the augmentation strategy. Empirical results on AVE, AVK-100/200/400 benchmarks show HAD consistently improves Average Incremental Accuracy and Final Incremental Accuracy over strong baselines, with ablations confirming the value of hierarchical design and multimodal fusion.
Abstract
Audio-visual video recognition (AVVR) aims to integrate audio and visual clues to categorize videos accurately. While existing methods train AVVR models using provided datasets and achieve satisfactory results, they struggle to retain historical class knowledge when confronted with new classes in real-world situations. Currently, there are no dedicated methods for addressing this problem, so this paper concentrates on exploring Class Incremental Audio-Visual Video Recognition (CIAVVR). For CIAVVR, since both stored data and learned model of past classes contain historical knowledge, the core challenge is how to capture past data knowledge and past model knowledge to prevent catastrophic forgetting. We introduce Hierarchical Augmentation and Distillation (HAD), which comprises the Hierarchical Augmentation Module (HAM) and Hierarchical Distillation Module (HDM) to efficiently utilize the hierarchical structure of data and models, respectively. Specifically, HAM implements a novel augmentation strategy, segmental feature augmentation, to preserve hierarchical model knowledge. Meanwhile, HDM introduces newly designed hierarchical (video-distribution) logical distillation and hierarchical (snippet-video) correlative distillation to capture and maintain the hierarchical intra-sample knowledge of each data and the hierarchical inter-sample knowledge between data, respectively. Evaluations on four benchmarks (AVE, AVK-100, AVK-200, and AVK-400) demonstrate that the proposed HAD effectively captures hierarchical information in both data and models, resulting in better preservation of historical class knowledge and improved performance. Furthermore, we provide a theoretical analysis to support the necessity of the segmental feature augmentation strategy.
