Are Modern Speech Enhancement Systems Vulnerable to Adversarial Attacks?
Rostislav Makarov, Lea Schönherr, Timo Gerkmann
TL;DR
This work addresses the security of modern speech enhancement against adversarial manipulation by framing SE as either predictive or generative (diffusion-based) and demonstrating targeted white-box attacks that push the enhanced output toward a chosen target. The authors develop an attack objective with psychoacoustic masking and an $oldsymbol{ ext{l}_2}$ budget, evaluating Direct Mapping, Complex Ratio Mask, and diffusion-based SE (SGMSE+) on the EARS-WHAM-v2 dataset. Their key findings show predictive SE is readily steered to attacker targets under modest budgets, whereas diffusion-based SE remains more robust, particularly when stochastic reverse sampling is used; deterministic or fewer reverse steps can increase attack success. The results highlight practical security implications for expressive SE systems and suggest diffusion-based approaches with maintained stochasticity as a more resilient direction for real-world deployments.
Abstract
Machine learning approaches for speech enhancement are becoming increasingly expressive, enabling ever more powerful modifications of input signals. In this paper, we demonstrate that this expressiveness introduces a vulnerability: advanced speech enhancement models can be susceptible to adversarial attacks. Specifically, we show that adversarial noise, carefully crafted and psychoacoustically masked by the original input, can be injected such that the enhanced speech output conveys an entirely different semantic meaning. We experimentally verify that contemporary predictive speech enhancement models can indeed be manipulated in this way. Furthermore, we highlight that diffusion models with stochastic samplers exhibit inherent robustness to such adversarial attacks by design.
