Table of Contents
Fetching ...

Double Mixture: Towards Continual Event Detection from Speech

Jingqi Kang, Tongtong Wu, Jinming Zhao, Guitao Wang, Yinwei Wei, Hao Yang, Guilin Qi, Yuan-Fang Li, Gholamreza Haffari

TL;DR

This paper introduces a new task, continual event detection from speech, and proposes a novel method, 'Double Mixture,' which merges speech expertise with robust memory mechanisms to enhance adaptability and prevent forgetting.

Abstract

Speech event detection is crucial for multimedia retrieval, involving the tagging of both semantic and acoustic events. Traditional ASR systems often overlook the interplay between these events, focusing solely on content, even though the interpretation of dialogue can vary with environmental context. This paper tackles two primary challenges in speech event detection: the continual integration of new events without forgetting previous ones, and the disentanglement of semantic from acoustic events. We introduce a new task, continual event detection from speech, for which we also provide two benchmark datasets. To address the challenges of catastrophic forgetting and effective disentanglement, we propose a novel method, 'Double Mixture.' This method merges speech expertise with robust memory mechanisms to enhance adaptability and prevent forgetting. Our comprehensive experiments show that this task presents significant challenges that are not effectively addressed by current state-of-the-art methods in either computer vision or natural language processing. Our approach achieves the lowest rates of forgetting and the highest levels of generalization, proving robust across various continual learning sequences. Our code and data are available at https://anonymous.4open.science/status/Continual-SpeechED-6461.

Double Mixture: Towards Continual Event Detection from Speech

TL;DR

This paper introduces a new task, continual event detection from speech, and proposes a novel method, 'Double Mixture,' which merges speech expertise with robust memory mechanisms to enhance adaptability and prevent forgetting.

Abstract

Speech event detection is crucial for multimedia retrieval, involving the tagging of both semantic and acoustic events. Traditional ASR systems often overlook the interplay between these events, focusing solely on content, even though the interpretation of dialogue can vary with environmental context. This paper tackles two primary challenges in speech event detection: the continual integration of new events without forgetting previous ones, and the disentanglement of semantic from acoustic events. We introduce a new task, continual event detection from speech, for which we also provide two benchmark datasets. To address the challenges of catastrophic forgetting and effective disentanglement, we propose a novel method, 'Double Mixture.' This method merges speech expertise with robust memory mechanisms to enhance adaptability and prevent forgetting. Our comprehensive experiments show that this task presents significant challenges that are not effectively addressed by current state-of-the-art methods in either computer vision or natural language processing. Our approach achieves the lowest rates of forgetting and the highest levels of generalization, proving robust across various continual learning sequences. Our code and data are available at https://anonymous.4open.science/status/Continual-SpeechED-6461.
Paper Structure (23 sections, 9 equations, 3 figures, 9 tables)

This paper contains 23 sections, 9 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 1: In continual learning, learners incrementally acquire new event types and must evaluate all previously learned types during testing. This process is particularly challenging in speech-based scenarios due to the complex interplay of semantic content (semantic event) and background sounds (acoustic event).
  • Figure 2: Framework of the proposed Double Mixture method.
  • Figure 3: Ablation study on three datasets, the horizontal axis represents different data sets, and the vertical axis represents the average accuracy of the entire task sequence.