Table of Contents
Fetching ...

EmoBack: Backdoor Attacks Against Speaker Identification Using Emotional Prosody

Coen Schoof, Stefanos Koffas, Mauro Conti, Stjepan Picek

TL;DR

This is the first work exploring SI DNNs' vulnerability to backdoor attacks using speakers' emotional prosody, resulting in dynamic, inconspicuous triggers and potential ways to reinforce backdoored models against the authors' attacks across multiple emotions.

Abstract

Speaker identification (SI) determines a speaker's identity based on their spoken utterances. Previous work indicates that SI deep neural networks (DNNs) are vulnerable to backdoor attacks. Backdoor attacks involve embedding hidden triggers in DNNs' training data, causing the DNN to produce incorrect output when these triggers are present during inference. This is the first work that explores SI DNNs' vulnerability to backdoor attacks using speakers' emotional prosody, resulting in dynamic, inconspicuous triggers. We conducted a parameter study using three different datasets and DNN architectures to determine the impact of emotions as backdoor triggers on the accuracy of SI systems. Additionally, we have explored the robustness of our attacks by applying defenses like pruning, STRIP-ViTA, and three popular preprocessing techniques: quantization, median filtering, and squeezing. Our findings show that the aforementioned models are prone to our attack, indicating that emotional triggers (sad and neutral prosody) can be effectively used to compromise the integrity of SI systems. However, the results of our pruning experiments suggest potential solutions for reinforcing the models against our attacks, decreasing the attack success rate up to 40%.

EmoBack: Backdoor Attacks Against Speaker Identification Using Emotional Prosody

TL;DR

This is the first work exploring SI DNNs' vulnerability to backdoor attacks using speakers' emotional prosody, resulting in dynamic, inconspicuous triggers and potential ways to reinforce backdoored models against the authors' attacks across multiple emotions.

Abstract

Speaker identification (SI) determines a speaker's identity based on their spoken utterances. Previous work indicates that SI deep neural networks (DNNs) are vulnerable to backdoor attacks. Backdoor attacks involve embedding hidden triggers in DNNs' training data, causing the DNN to produce incorrect output when these triggers are present during inference. This is the first work that explores SI DNNs' vulnerability to backdoor attacks using speakers' emotional prosody, resulting in dynamic, inconspicuous triggers. We conducted a parameter study using three different datasets and DNN architectures to determine the impact of emotions as backdoor triggers on the accuracy of SI systems. Additionally, we have explored the robustness of our attacks by applying defenses like pruning, STRIP-ViTA, and three popular preprocessing techniques: quantization, median filtering, and squeezing. Our findings show that the aforementioned models are prone to our attack, indicating that emotional triggers (sad and neutral prosody) can be effectively used to compromise the integrity of SI systems. However, the results of our pruning experiments suggest potential solutions for reinforcing the models against our attacks, decreasing the attack success rate up to 40%.
Paper Structure (40 sections, 6 equations, 7 figures, 1 table)

This paper contains 40 sections, 6 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Illustration of the proposed attack. An adversary chooses a target speaker ID and a trigger emotion. Next, they poison the dataset, which is used to train a DNN, resulting in a backdoored DNN. During inference, the target ID will erroneously be inferred when the adversary passes speech samples to the backdoored model containing the trigger.
  • Figure 2: CA and ASR of the proposed attack for each combination of targeted DNN, dataset, trigger emotion, and speaker gender. The figure shows the results with the poisoning rate of $10\%$ (black text) and $5\%$ (blue text). Notice that RAVDESS results have no data for Neutral where the poisoning rate = $10\%$. This is due to RAVDESS, prior to preprocessing, having too few Neutral samples to achieve this poisoning rate.
  • Figure 3: Results of the pruning defense against the best performing models trained on the ESD-en dataset. "conv" refers to the convolutional layer rate, and conv=$-1.0$ to pruning where only the final convolutional layer was pruned.
  • Figure 4: Results of the pruning defense against the best performing models trained on the ESD-zh dataset.
  • Figure 5: FRR and FAR of our attacks that yielded the highest ASR. The figure shows the results with the poisoning rate of $10\%$.
  • ...and 2 more figures