Table of Contents
Fetching ...

Mirage Fools the Ear, Mute Hides the Truth: Precise Targeted Adversarial Attacks on Polyphonic Sound Event Detection Systems

Junjie Su, Weifei Jin, Yuxin Cao, Derui Wang, Kai Ye, Jie Hao

TL;DR

The paper tackles the vulnerability of polyphonic SED systems to targeted adversarial attacks by introducing the Mirage and Mute Attack (M2A), a framework that achieves precise edits through a preservation constraint on non-target regions. It adds a novel Editing Precision (EP) metric to jointly evaluate attack success and precision. Through extensive experiments on DESED and TUT-SED with CRNN and ATST-SED, M2A demonstrates high EP and improved precision over baselines, highlighting security risks in safety-critical audio surveillance. The work further analyzes defenses and discusses limitations, emphasizing the need for robust, context-aware SED designs and ensemble strategies to mitigate adversarial threats in real-world deployments.

Abstract

Sound Event Detection (SED) systems are increasingly deployed in safety-critical applications such as industrial monitoring and audio surveillance. However, their robustness against adversarial attacks has not been well explored. Existing audio adversarial attacks targeting SED systems, which incorporate both detection and localization capabilities, often lack effectiveness due to SED's strong contextual dependencies or lack precision by focusing solely on misclassifying the target region as the target event, inadvertently affecting non-target regions. To address these challenges, we propose the Mirage and Mute Attack (M2A) framework, which is designed for targeted adversarial attacks on polyphonic SED systems. In our optimization process, we impose specific constraints on the non-target output, which we refer to as preservation loss, ensuring that our attack does not alter the model outputs for non-target region, thus achieving precise attacks. Furthermore, we introduce a novel evaluation metric Editing Precison (EP) that balances effectiveness and precision, enabling our method to simultaneously enhance both. Comprehensive experiments show that M2A achieves 94.56% and 99.11% EP on two state-of-the-art SED models, demonstrating that the framework is sufficiently effective while significantly enhancing attack precision.

Mirage Fools the Ear, Mute Hides the Truth: Precise Targeted Adversarial Attacks on Polyphonic Sound Event Detection Systems

TL;DR

The paper tackles the vulnerability of polyphonic SED systems to targeted adversarial attacks by introducing the Mirage and Mute Attack (M2A), a framework that achieves precise edits through a preservation constraint on non-target regions. It adds a novel Editing Precision (EP) metric to jointly evaluate attack success and precision. Through extensive experiments on DESED and TUT-SED with CRNN and ATST-SED, M2A demonstrates high EP and improved precision over baselines, highlighting security risks in safety-critical audio surveillance. The work further analyzes defenses and discusses limitations, emphasizing the need for robust, context-aware SED designs and ensemble strategies to mitigate adversarial threats in real-world deployments.

Abstract

Sound Event Detection (SED) systems are increasingly deployed in safety-critical applications such as industrial monitoring and audio surveillance. However, their robustness against adversarial attacks has not been well explored. Existing audio adversarial attacks targeting SED systems, which incorporate both detection and localization capabilities, often lack effectiveness due to SED's strong contextual dependencies or lack precision by focusing solely on misclassifying the target region as the target event, inadvertently affecting non-target regions. To address these challenges, we propose the Mirage and Mute Attack (M2A) framework, which is designed for targeted adversarial attacks on polyphonic SED systems. In our optimization process, we impose specific constraints on the non-target output, which we refer to as preservation loss, ensuring that our attack does not alter the model outputs for non-target region, thus achieving precise attacks. Furthermore, we introduce a novel evaluation metric Editing Precison (EP) that balances effectiveness and precision, enabling our method to simultaneously enhance both. Comprehensive experiments show that M2A achieves 94.56% and 99.11% EP on two state-of-the-art SED models, demonstrating that the framework is sufficiently effective while significantly enhancing attack precision.

Paper Structure

This paper contains 21 sections, 15 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: All attacks aim to delete 'door open' and insert 'dog barking'. Ineffective attacks fail to fully achieve their goal, while imprecise ones disrupt the non-target region. Our method ensures effectiveness while better preserving the non-target region.
  • Figure 2: Polyphonic SED output: The audio is divided into multiple frames and each frame predicts the presence of different sound events.
  • Figure 3: Visualization of the target region and the non-target region under $\text{M}^2\text{A}$ framework.
  • Figure 4: Effect of different $\alpha$ values on attack performance. There exists a trade-off between ASR and UER. We selected the $\alpha$ value that maximizes EP while ensuring sufficient ASR.
  • Figure 5: Effect of different $\tau$ values on attack performance. The trends of EP and ASR exhibit strong similarity, allowing $\tau$ selection to maximize EP with adequate ASR.
  • ...and 2 more figures