Are Modern Speech Enhancement Systems Vulnerable to Adversarial Attacks?

Rostislav Makarov; Lea Schönherr; Timo Gerkmann

Are Modern Speech Enhancement Systems Vulnerable to Adversarial Attacks?

Rostislav Makarov, Lea Schönherr, Timo Gerkmann

TL;DR

This work addresses the security of modern speech enhancement against adversarial manipulation by framing SE as either predictive or generative (diffusion-based) and demonstrating targeted white-box attacks that push the enhanced output toward a chosen target. The authors develop an attack objective with psychoacoustic masking and an $oldsymbol{ ext{l}_2}$ budget, evaluating Direct Mapping, Complex Ratio Mask, and diffusion-based SE (SGMSE+) on the EARS-WHAM-v2 dataset. Their key findings show predictive SE is readily steered to attacker targets under modest budgets, whereas diffusion-based SE remains more robust, particularly when stochastic reverse sampling is used; deterministic or fewer reverse steps can increase attack success. The results highlight practical security implications for expressive SE systems and suggest diffusion-based approaches with maintained stochasticity as a more resilient direction for real-world deployments.

Abstract

Machine learning approaches for speech enhancement are becoming increasingly expressive, enabling ever more powerful modifications of input signals. In this paper, we demonstrate that this expressiveness introduces a vulnerability: advanced speech enhancement models can be susceptible to adversarial attacks. Specifically, we show that adversarial noise, carefully crafted and psychoacoustically masked by the original input, can be injected such that the enhanced speech output conveys an entirely different semantic meaning. We experimentally verify that contemporary predictive speech enhancement models can indeed be manipulated in this way. Furthermore, we highlight that diffusion models with stochastic samplers exhibit inherent robustness to such adversarial attacks by design.

Are Modern Speech Enhancement Systems Vulnerable to Adversarial Attacks?

TL;DR

Abstract

Are Modern Speech Enhancement Systems Vulnerable to Adversarial Attacks?

TL;DR

Abstract

Paper Structure

Table of Contents