Adjust-free adversarial example generation in speech recognition using evolutionary multi-objective optimization under black-box condition
Shoma Ishida, Satoshi Ono
TL;DR
The paper tackles the practical vulnerability of ASR systems to adversarial perturbations in black-box settings by introducing adjust-free adversarial examples that remain effective despite timing lag. It formulates adversarial generation as a three-objective optimization problem and solves it with MOEA/D, optimizing the expected misclassification probability, its variance, and MFCC distancess under random lag $τ_i\in[-T_{max},T_{max}]$. Empirical results on a speech command model show the approach can produce robust perturbations that maintain misclassification across a range of timing differences, outperforming prior black-box methods in many classes. This work highlights a realistic threat in deployed ASR systems and demonstrates how EMO can craft robust perturbations without precise timing, informing defense strategies and safety analyses.
Abstract
This paper proposes a black-box adversarial attack method to automatic speech recognition systems. Some studies have attempted to attack neural networks for speech recognition; however, these methods did not consider the robustness of generated adversarial examples against timing lag with a target speech. The proposed method in this paper adopts Evolutionary Multi-objective Optimization (EMO)that allows it generating robust adversarial examples under black-box scenario. Experimental results showed that the proposed method successfully generated adjust-free adversarial examples, which are sufficiently robust against timing lag so that an attacker does not need to take the timing of playing it against the target speech.
