Table of Contents
Fetching ...

Are Deep Speech Denoising Models Robust to Adversarial Noise?

Will Schwarzer, Philip S. Thomas, Andrea Fanelli, Xiaoyu Liu

TL;DR

This study demonstrates that four prominent deep-noise-suppression models are vulnerable to imperceptible adversarial perturbations, capable of making outputs gibberish or near-targeted speech under white-box conditions and even in simulated over-the-air settings. The authors develop an attack framework based on psychoacoustic masking and projected gradient descent, using STOI as the optimization objective, and show that defenses like simple Gaussian noise provide only limited protection. They analyze four DNS architectures (Demucs, FSN+, FRCRN, MP-SENet) across untargeted and targeted attacks, revealing that model-transferability of attacks is generally weak and that FSN+ exhibits pseudo-robustness due to gradient instability. The work highlights a practical security concern for DNS systems in real-world use (e.g., communication, hearing aids) and emphasizes the need for stronger defenses and broader threat-model testing. Overall, the paper expands adversarial evaluation beyond ASR to generative audio denoising, demonstrating both the feasibility and the limitations of current attacks and defenses.

Abstract

Deep noise suppression (DNS) models enjoy widespread use throughout a variety of high-stakes speech applications. However, in this paper, we show that four recent DNS models can each be reduced to outputting unintelligible gibberish through the addition of imperceptible adversarial noise. Furthermore, our results show the near-term plausibility of targeted attacks, which could induce models to output arbitrary utterances, and over-the-air attacks. While the success of these attacks varies by model and setting, and attacks appear to be strongest when model-specific (i.e., white-box and non-transferable), our results highlight a pressing need for practical countermeasures in DNS systems.

Are Deep Speech Denoising Models Robust to Adversarial Noise?

TL;DR

This study demonstrates that four prominent deep-noise-suppression models are vulnerable to imperceptible adversarial perturbations, capable of making outputs gibberish or near-targeted speech under white-box conditions and even in simulated over-the-air settings. The authors develop an attack framework based on psychoacoustic masking and projected gradient descent, using STOI as the optimization objective, and show that defenses like simple Gaussian noise provide only limited protection. They analyze four DNS architectures (Demucs, FSN+, FRCRN, MP-SENet) across untargeted and targeted attacks, revealing that model-transferability of attacks is generally weak and that FSN+ exhibits pseudo-robustness due to gradient instability. The work highlights a practical security concern for DNS systems in real-world use (e.g., communication, hearing aids) and emphasizes the need for stronger defenses and broader threat-model testing. Overall, the paper expands adversarial evaluation beyond ASR to generative audio denoising, demonstrating both the feasibility and the limitations of current attacks and defenses.

Abstract

Deep noise suppression (DNS) models enjoy widespread use throughout a variety of high-stakes speech applications. However, in this paper, we show that four recent DNS models can each be reduced to outputting unintelligible gibberish through the addition of imperceptible adversarial noise. Furthermore, our results show the near-term plausibility of targeted attacks, which could induce models to output arbitrary utterances, and over-the-air attacks. While the success of these attacks varies by model and setting, and attacks appear to be strongest when model-specific (i.e., white-box and non-transferable), our results highlight a pressing need for practical countermeasures in DNS systems.

Paper Structure

This paper contains 25 sections, 10 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: STOI provides a measure of the intelligibility of audio. The curves in these figures report how much DNS models have enhanced the intelligibility by reporting the difference in STOI between the inputs and outputs of the DNS models (i.e., the intelligibility of the speech prior to denoising, and the intelligibility of the speech after denoising). That is, $\mathop{\mathrm{STOI}}\limits(\mathop{\mathrm{clean}}\limits, \mathop{\mathrm{output}}\limits) - \mathop{\mathrm{STOI}}\limits(\mathop{\mathrm{clean}}\limits, \mathop{\mathrm{input}}\limits)$. The dashed lines show the performance of the model for the initial input speech, which is typically greater than zero, indicating that the DNS model successfully enhanced the audio. The solid lines show the performance of the model for the input with the inclusion of the imperceptible adversarially generated perturbations. These values being negative indicates that the model not only made the speech less intelligible, but often did so to the point of rendering it unintelligible. Error bars show standard error across 20 seeds.
  • Figure 2: STOI provides a measure of the intelligibility of audio. The curves in this plot represent how intelligible the target speech (not actually present in the original input) is in the model's inputs and outputs, relative to the clean speech (actually present). That is, $\mathop{\mathrm{STOI}}\limits(\mathop{\mathrm{target}}\limits, \mathop{\mathrm{audio}}\limits) - \mathop{\mathrm{STOI}}\limits(\mathop{\mathrm{clean}}\limits, \mathop{\mathrm{audio}}\limits)$, where $\mathop{\mathrm{audio}}\limits$ is either the attacked input to the model (dashed lines) or the model's output (solid lines). The dashed lines show the relative intelligibility of the target and clean speech within the attacked input audio (clean speech plus background noise and adversarial perturbation). These values being negative indicates that the clean speech is more intelligible than the target speech, despite the perturbation. The solid lines show the relative intelligibility of the target and clean speech within the model's output, given the attacked audio. These values generally being positive indicates that the model outputted audio in which the target speech (not present in its input) is more intelligible than the clean speech (its desired output). Error bars show standard error across 20 seeds. Attacks used target speech taken from the same speaker as the clean speech; synthesized targets were empirically ineffective.
  • Figure 3: Normalized values of various speech intelligibility and quality metrics, averaged across all models and settings of Figure \ref{['fig:env-untargeted']} and 20 seeds. "Output" refers to model output given the attacked input. ASR accuracy is computed as $1 - \min (\textrm{WER}, 1)$. Ranges used for normalization were: STOI: [-1, 1]. ViSQOL: [1, 5]. NISQA: [0, 5]. DNSMOS: [0, 5]. ASR accuracy: [0, 1]. ASR ground-truth is determined by the ASR model (Whisper) applied to the clean, unreverberated speech. Results suggest that attacked inputs are mostly indistinguishable from clean inputs, while attacked model outputs are far worse than either input. Results vary more for intrusive metrics (STOI, ViSQOL, ASR accuracy) than unintrusive; because the minimization target (STOI) is intrusive, this is consistent with prior results showing intrusive metrics to be poorly correlated with non-intrusive ones deoliveira2023behaviorintrusivenonintrusivespeech.
  • Figure 4: $\mathop{\mathrm{STOI}}\limits(\mathop{\mathrm{output}}\limits, \mathop{\mathrm{clean}}\limits) - \mathop{\mathrm{STOI}}\limits(\mathop{\mathrm{input}}\limits, \mathop{\mathrm{clean}}\limits)$ on attacked inputs passed through Gaussian perturbation (referred to as "white noise defense" (WND)) of varying magnitudes, averaged across all models and settings of Figure \ref{['fig:env-untargeted']} and 20 seeds. (See Figure \ref{['fig:env-untargeted']} for an explanation of the metric.) Moderate Gaussian perturbation enhances model robustness by subsuming the adversarial perturbation, though does not recover the model's original unattacked performance. Error bars are standard error (for $n = \textrm{number of seeds}$).
  • Figure 5: The simulated over-the-air experiment. Similar to Figure \ref{['fig:env-untargeted']}, we plot $\mathop{\mathrm{STOI}}\limits(\mathop{\mathrm{output}}\limits, \mathop{\mathrm{clean}}\limits) - \mathop{\mathrm{STOI}}\limits(\mathop{\mathrm{input}}\limits, \mathop{\mathrm{clean}}\limits)$ for both normal model inputs (dashed lines) and model inputs combined with an untargeted adversarial perturbation (solid lines); however, in this experiment, the adversarial perturbation is subjected to the same acoustic conditions as the rest of the input, by convolving it with an RIR. Error bars show standard error across 20 seeds. Results show that attacks succeed at incapacitating all models except FSN+ even in this challenging threat model.