Table of Contents
Fetching ...

Hierarchical Augmentation and Distillation for Class Incremental Audio-Visual Video Recognition

Yukun Zuo, Hantao Yao, Liansheng Zhuang, Changsheng Xu

TL;DR

This work tackles Class Incremental Audio-Visual Video Recognition (CIAVVR), where models must learn new classes without forgetting old ones in a multimodal setting. It introduces Hierarchical Augmentation and Distillation (HAD), comprising the Hierarchical Augmentation Module (HAM) to diversify past data through segmental feature augmentation and the Hierarchical Distillation Module (HDM) to preserve data knowledge via hierarchical logical and correlative distillations. The methods exploit hierarchical structure in both model and data, using video-distribution proxies and snippet-video correlations to retain intra- and inter-sample knowledge, and provide theoretical support for the augmentation strategy. Empirical results on AVE, AVK-100/200/400 benchmarks show HAD consistently improves Average Incremental Accuracy and Final Incremental Accuracy over strong baselines, with ablations confirming the value of hierarchical design and multimodal fusion.

Abstract

Audio-visual video recognition (AVVR) aims to integrate audio and visual clues to categorize videos accurately. While existing methods train AVVR models using provided datasets and achieve satisfactory results, they struggle to retain historical class knowledge when confronted with new classes in real-world situations. Currently, there are no dedicated methods for addressing this problem, so this paper concentrates on exploring Class Incremental Audio-Visual Video Recognition (CIAVVR). For CIAVVR, since both stored data and learned model of past classes contain historical knowledge, the core challenge is how to capture past data knowledge and past model knowledge to prevent catastrophic forgetting. We introduce Hierarchical Augmentation and Distillation (HAD), which comprises the Hierarchical Augmentation Module (HAM) and Hierarchical Distillation Module (HDM) to efficiently utilize the hierarchical structure of data and models, respectively. Specifically, HAM implements a novel augmentation strategy, segmental feature augmentation, to preserve hierarchical model knowledge. Meanwhile, HDM introduces newly designed hierarchical (video-distribution) logical distillation and hierarchical (snippet-video) correlative distillation to capture and maintain the hierarchical intra-sample knowledge of each data and the hierarchical inter-sample knowledge between data, respectively. Evaluations on four benchmarks (AVE, AVK-100, AVK-200, and AVK-400) demonstrate that the proposed HAD effectively captures hierarchical information in both data and models, resulting in better preservation of historical class knowledge and improved performance. Furthermore, we provide a theoretical analysis to support the necessity of the segmental feature augmentation strategy.

Hierarchical Augmentation and Distillation for Class Incremental Audio-Visual Video Recognition

TL;DR

This work tackles Class Incremental Audio-Visual Video Recognition (CIAVVR), where models must learn new classes without forgetting old ones in a multimodal setting. It introduces Hierarchical Augmentation and Distillation (HAD), comprising the Hierarchical Augmentation Module (HAM) to diversify past data through segmental feature augmentation and the Hierarchical Distillation Module (HDM) to preserve data knowledge via hierarchical logical and correlative distillations. The methods exploit hierarchical structure in both model and data, using video-distribution proxies and snippet-video correlations to retain intra- and inter-sample knowledge, and provide theoretical support for the augmentation strategy. Empirical results on AVE, AVK-100/200/400 benchmarks show HAD consistently improves Average Incremental Accuracy and Final Incremental Accuracy over strong baselines, with ablations confirming the value of hierarchical design and multimodal fusion.

Abstract

Audio-visual video recognition (AVVR) aims to integrate audio and visual clues to categorize videos accurately. While existing methods train AVVR models using provided datasets and achieve satisfactory results, they struggle to retain historical class knowledge when confronted with new classes in real-world situations. Currently, there are no dedicated methods for addressing this problem, so this paper concentrates on exploring Class Incremental Audio-Visual Video Recognition (CIAVVR). For CIAVVR, since both stored data and learned model of past classes contain historical knowledge, the core challenge is how to capture past data knowledge and past model knowledge to prevent catastrophic forgetting. We introduce Hierarchical Augmentation and Distillation (HAD), which comprises the Hierarchical Augmentation Module (HAM) and Hierarchical Distillation Module (HDM) to efficiently utilize the hierarchical structure of data and models, respectively. Specifically, HAM implements a novel augmentation strategy, segmental feature augmentation, to preserve hierarchical model knowledge. Meanwhile, HDM introduces newly designed hierarchical (video-distribution) logical distillation and hierarchical (snippet-video) correlative distillation to capture and maintain the hierarchical intra-sample knowledge of each data and the hierarchical inter-sample knowledge between data, respectively. Evaluations on four benchmarks (AVE, AVK-100, AVK-200, and AVK-400) demonstrate that the proposed HAD effectively captures hierarchical information in both data and models, resulting in better preservation of historical class knowledge and improved performance. Furthermore, we provide a theoretical analysis to support the necessity of the segmental feature augmentation strategy.
Paper Structure (33 sections, 28 equations, 8 figures, 9 tables)

This paper contains 33 sections, 28 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: (a) Most of class incremental learning methods focus on image-level knowledge preservation. (b) We focus on class incremental audio-visual video recognition containing visual information and audio information.
  • Figure 2: The hierarchical structure in model and video data. For the model, low-level and high-level features embody different semantic information. Moreover, the data comprises distribution-level, video-level, and snippet-level spatial information.
  • Figure 3: The proposed Hierarchical Augmentation and Distillation (HAD) framework consists of Hierarchical Augmentation Module (HAM) and Hierarchical Distillation Module (HDM). HAM utilizes segmental feature augmentation to conduct the low-level and high-level feature augmentations for enhancing data knowledge preservation. Moreover, HDM consisting of hierarchical logical distillation (HLD) and hierarchical correlative distillation (HCD) employs video-distribution logical distillation and snippet-video correlative distillation for model knowledge preservation.
  • Figure 4: The accuracy in AVE, AVK-100, AVK-200, and AVK-400 with different phases.
  • Figure 5: (a) Analysis of augmentation noise on AVE 3 phases. (b) Analysis of multimodal on AVE 3 phases.
  • ...and 3 more figures