Table of Contents
Fetching ...

STEP: Detecting Audio Backdoor Attacks via Stability-based Trigger Exposure Profiling

Kun Wang, Meng Chen, Junhao Wang, Yuli Wu, Li Lu, Chong Zhang, Peng Cheng, Jiaheng Zhang, Kui Ren

Abstract

With the widespread deployment of deep-learning-based speech models in security-critical applications, backdoor attacks have emerged as a serious threat: an adversary who poisons a small fraction of training data can implant a hidden trigger that controls the model's output while preserving normal behavior on clean inputs. Existing inference-time defenses are not well suited to the audio domain, as they either rely on trigger over-robustness assumptions that fail on transformation-based and semantic triggers, or depend on properties specific to image or text modalities. In this paper, we propose STEP (Stability-based Trigger Exposure Profiling), a black-box, retraining-free backdoor detector that operates under hard-label-only access. Its core idea is to exploit a characteristic dual anomaly of backdoor triggers: anomalous label stability under semantic-breaking perturbations, and anomalous label fragility under semantic-preserving perturbations. STEP profiles each test sample with two complementary perturbation branches that target these two properties respectively, scores the resulting stability features with one-class anomaly detectors trained on benign references, and fuses the two scores via unsupervised weighting. Extensive experiments across seven backdoor attacks show that STEP achieves an average AUROC of 97.92% and EER of 4.54%, substantially outperforming state-of-the-art baselines, and generalizes across model architectures, speech tasks, an open-set verification scenario, and over-the-air physical-world settings.

STEP: Detecting Audio Backdoor Attacks via Stability-based Trigger Exposure Profiling

Abstract

With the widespread deployment of deep-learning-based speech models in security-critical applications, backdoor attacks have emerged as a serious threat: an adversary who poisons a small fraction of training data can implant a hidden trigger that controls the model's output while preserving normal behavior on clean inputs. Existing inference-time defenses are not well suited to the audio domain, as they either rely on trigger over-robustness assumptions that fail on transformation-based and semantic triggers, or depend on properties specific to image or text modalities. In this paper, we propose STEP (Stability-based Trigger Exposure Profiling), a black-box, retraining-free backdoor detector that operates under hard-label-only access. Its core idea is to exploit a characteristic dual anomaly of backdoor triggers: anomalous label stability under semantic-breaking perturbations, and anomalous label fragility under semantic-preserving perturbations. STEP profiles each test sample with two complementary perturbation branches that target these two properties respectively, scores the resulting stability features with one-class anomaly detectors trained on benign references, and fuses the two scores via unsupervised weighting. Extensive experiments across seven backdoor attacks show that STEP achieves an average AUROC of 97.92% and EER of 4.54%, substantially outperforming state-of-the-art baselines, and generalizes across model architectures, speech tasks, an open-set verification scenario, and over-the-air physical-world settings.
Paper Structure (27 sections, 6 equations, 9 figures, 5 tables, 2 algorithms)

This paper contains 27 sections, 6 equations, 9 figures, 5 tables, 2 algorithms.

Figures (9)

  • Figure 1: Illustration of a backdoor attack on a speech model. During training, a small fraction of samples are poisoned with a trigger pattern and relabeled to a target class. At inference time, the backdoored model behaves normally on clean inputs but produces the attacker-chosen output whenever the trigger is present.
  • Figure 2: Mel spectrograms of representative backdoor triggers. SineTone and Natural are additive triggers that inject localized signal patterns, whereas JingleBack and TrojanRoom apply global convolution-based transformations that alter the entire spectral structure yet remain imperceptible.
  • Figure 3: Overview of the STEP detection pipeline. The Stability-Probing Perturbation module applies semantic-preserving distortions and semantic-breaking superimpositions to each test input, producing a stability profile per family. The Response Profile-Based Detection module scores each profile with a one-class anomaly detector trained on benign references and fuses the two scores into a final detection decision.
  • Figure 4: Portability across architectures and tasks: AUROC (%$\uparrow$) under three settings. xvect+SR: main experiment; ECAPA+SR: architecture transfer; xvect+SCR: task transfer.
  • Figure 5: Portability across architectures and tasks: EER (%$\downarrow$) under the same three settings.
  • ...and 4 more figures