Imperceptible Rhythm Backdoor Attacks: Exploring Rhythm Transformation for Embedding Undetectable Vulnerabilities on Speech Recognition

Wenhan Yao; Jiangkun Yang; Yongqiang He; Jia Liu; Weiping Wen

Imperceptible Rhythm Backdoor Attacks: Exploring Rhythm Transformation for Embedding Undetectable Vulnerabilities on Speech Recognition

Wenhan Yao, Jiangkun Yang, Yongqiang He, Jia Liu, Weiping Wen

TL;DR

A non-neural and fast algorithm called Random Spectrogram Rhythm Transformation (RSRT) is proposed, which combines four steps to generate stealthy poisoned utterances and has excellent effectiveness and stealthiness.

Abstract

Speech recognition is an essential start ring of human-computer interaction, and recently, deep learning models have achieved excellent success in this task. However, when the model training and private data provider are always separated, some security threats that make deep neural networks (DNNs) abnormal deserve to be researched. In recent years, the typical backdoor attacks have been researched in speech recognition systems. The existing backdoor methods are based on data poisoning. The attacker adds some incorporated changes to benign speech spectrograms or changes the speech components, such as pitch and timbre. As a result, the poisoned data can be detected by human hearing or automatic deep algorithms. To improve the stealthiness of data poisoning, we propose a non-neural and fast algorithm called Random Spectrogram Rhythm Transformation (RSRT) in this paper. The algorithm combines four steps to generate stealthy poisoned utterances. From the perspective of rhythm component transformation, our proposed trigger stretches or squeezes the mel spectrograms and recovers them back to signals. The operation keeps timbre and content unchanged for good stealthiness. Our experiments are conducted on two kinds of speech recognition tasks, including testing the stealthiness of poisoned samples by speaker verification and automatic speech recognition. The results show that our method has excellent effectiveness and stealthiness. The rhythm trigger needs a low poisoning rate and gets a very high attack success rate.

Imperceptible Rhythm Backdoor Attacks: Exploring Rhythm Transformation for Embedding Undetectable Vulnerabilities on Speech Recognition

TL;DR

Abstract

Paper Structure (27 sections, 8 equations, 5 figures, 7 tables, 2 algorithms)

This paper contains 27 sections, 8 equations, 5 figures, 7 tables, 2 algorithms.

Introduction
Related Work
Speech Recognition
Backdoor Attacks in Computer Vision
Backdoor Attacks in Speech Recognition
Methods
Motivation
Preliminaries
Neural Vocoder
Threat Model
The Goal of Adversary
Poisoning-Based Backdoor Attacks Pipeline
Attack via Random Spectrogram Rhythm Transformation
Voice Active Detection
RSRT Methods
...and 12 more sections

Figures (5)

Figure 1: Backdoor attacks by changing speech components.
Figure 2: The proposed attack pipeline via RSRT consists of three main stages: (a) the Attack Stage, (b) the Training Stage, and (c) the Inference Stage. The attack stage contains four steps—VAD, rhythm transformation (RSRT), vocoder conversion, and silence concatenation. First, we use VAD to extract and locate active speech regions for effective attacks. Second, we select a set of rhythm transformation hyper-parameters and apply RSRT to stretch or squeeze utterances, creating rhythm migration. Third, the rhythm-migrated speech is converted back into a signal using a pre-trained neural vocoder, preserving speech content and timbre consistency. Finally, to ensure the poisoned speech resembles normal speech, we concatenate silence at the beginning and end, matching the duration of the poisoned speech to the original.
Figure 3: The illustration of rhythm transformation. (a) denotes the process of stretching algorithm. Some frames are copied in the next places of the original index, while the other frames are retained. (b) denotes the process of squeezing the algorithm. A part of the frames and their next frames are selected to form new frames by double linear weight sum. (c) and (d) respectively show the speech spectrograms squeezed to $1/2$ times and $2/3$ times and stretched to 1.3 times to 2.0 times.
Figure 4: The VAD result. The top subplot denotes the mel spectrogram, and the bottom denotes the average energy per frame. The red box highlights the non-voice portion.
Figure 5: The mel spectrogram visualization of different poison samples with mentioned triggers. The (b) and (c) show our proposed triggers. The (d)-(e) shows the poisoning utterance by perturbation triggers. The (e) and (f) show the poisoning utterance by element triggers.

Imperceptible Rhythm Backdoor Attacks: Exploring Rhythm Transformation for Embedding Undetectable Vulnerabilities on Speech Recognition

TL;DR

Abstract

Imperceptible Rhythm Backdoor Attacks: Exploring Rhythm Transformation for Embedding Undetectable Vulnerabilities on Speech Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (5)