Towards Evaluating the Robustness of Automatic Speech Recognition Systems via Audio Style Transfer

Weifei Jin; Yuxin Cao; Junjie Su; Qi Shen; Kai Ye; Derui Wang; Jie Hao; Ziyao Liu

Towards Evaluating the Robustness of Automatic Speech Recognition Systems via Audio Style Transfer

Weifei Jin, Yuxin Cao, Junjie Su, Qi Shen, Kai Ye, Derui Wang, Jie Hao, Ziyao Liu

TL;DR

This work investigates ASR robustness against adversarial attacks anchored in audio style transfer. It introduces two attack schemes, Style Transfer Attack (STA) and Style Code Attack (SCA), to achieve high attack success while maintaining natural-sounding audio, with SCA offering improved perceptual quality through iterative style-code optimization. Experiments on VCTK with DeepSpeech2 demonstrate attack success rates around 82–85% and reveal that rhythm perturbations have a larger impact on transcription than pitch alone, while user studies show SCA yields markedly better perceived quality than STA. The study highlights potential defenses such as de-stylization and emphasizes the need to understand and mitigate style-code-based vulnerabilities in ASR systems, with implications for practical security in voice-enabled applications.

Abstract

In light of the widespread application of Automatic Speech Recognition (ASR) systems, their security concerns have received much more attention than ever before, primarily due to the susceptibility of Deep Neural Networks. Previous studies have illustrated that surreptitiously crafting adversarial perturbations enables the manipulation of speech recognition systems, resulting in the production of malicious commands. These attack methods mostly require adding noise perturbations under $\ell_p$ norm constraints, inevitably leaving behind artifacts of manual modifications. Recent research has alleviated this limitation by manipulating style vectors to synthesize adversarial examples based on Text-to-Speech (TTS) synthesis audio. However, style modifications based on optimization objectives significantly reduce the controllability and editability of audio styles. In this paper, we propose an attack on ASR systems based on user-customized style transfer. We first test the effect of Style Transfer Attack (STA) which combines style transfer and adversarial attack in sequential order. And then, as an improvement, we propose an iterative Style Code Attack (SCA) to maintain audio quality. Experimental results show that our method can meet the need for user-customized styles and achieve a success rate of 82% in attacks, while keeping sound naturalness due to our user study.

Towards Evaluating the Robustness of Automatic Speech Recognition Systems via Audio Style Transfer

TL;DR

Abstract

norm constraints, inevitably leaving behind artifacts of manual modifications. Recent research has alleviated this limitation by manipulating style vectors to synthesize adversarial examples based on Text-to-Speech (TTS) synthesis audio. However, style modifications based on optimization objectives significantly reduce the controllability and editability of audio styles. In this paper, we propose an attack on ASR systems based on user-customized style transfer. We first test the effect of Style Transfer Attack (STA) which combines style transfer and adversarial attack in sequential order. And then, as an improvement, we propose an iterative Style Code Attack (SCA) to maintain audio quality. Experimental results show that our method can meet the need for user-customized styles and achieve a success rate of 82% in attacks, while keeping sound naturalness due to our user study.

Paper Structure (18 sections, 6 equations, 9 figures, 1 table, 2 algorithms)

This paper contains 18 sections, 6 equations, 9 figures, 1 table, 2 algorithms.

Introduction
Related Work
Audio Style Transfer
Audio Adversarial Attack
Threat Model
Methodology
Problem Definition
Style Transfer Attack
Style Code Attack
Experimental Results
Experiment Settings
Attack Performance Analyses
User Study
Acoustic Features Analyses
Ablation Study
...and 3 more sections

Figures (9)

Figure 1: Comparison between our attack and traditional attacks.
Figure 2: An overview of STA and SCA.
Figure 3: Illustration of style transfer operations. There are three style selection options: pitch only, rhythm only, and both pitch and rhythm. For our attacks, we replace both pitch and rhythm code simultaneously.
Figure 4: Performance of STA and SCA under different Levenshtein distances.
Figure 5: The proportion of examples with the Levenshtein distance between the transcription result and the target text after the maximum number of iterations to the total number of failed attack examples.
...and 4 more figures

Towards Evaluating the Robustness of Automatic Speech Recognition Systems via Audio Style Transfer

TL;DR

Abstract

Towards Evaluating the Robustness of Automatic Speech Recognition Systems via Audio Style Transfer

Authors

TL;DR

Abstract

Table of Contents

Figures (9)