Table of Contents
Fetching ...

Multi-Target Backdoor Attacks Against Speaker Recognition

Alexandrine Fortier, Sonal Joshi, Thomas Thebaud, Jesús Villalba, Najim Dehak, Patrick Cardinal

TL;DR

This work reveals a practical multi-target backdoor vulnerability in speaker recognition by injecting natural clicking sound triggers into training data to impersonate up to 50 target speakers in closed-set identification and to impersonate enrolled victims in open-set verification. The method uses dirty-label poisoning and position-independent triggers with variable SNR, demonstrating strong attack success in speaker identification and conditional success in speaker verification depending on embedding similarity between target and victim. Key findings include high ASR for multi-target SI (up to 95% for some sub-attacks, with average around 69% at 50 targets) and strong SV performance only when target and victim embeddings are highly similar (cosine similarity above ~0.80). The results emphasize significant security risks for outsourced model development and data collection pipelines, while highlighting limitations and directions for defense, such as sequential poisoning and real-world trigger injection scenarios.

Abstract

In this work, we propose a multi-target backdoor attack against speaker identification using position-independent clicking sounds as triggers. Unlike previous single-target approaches, our method targets up to 50 speakers simultaneously, achieving success rates of up to 95.04%. To simulate more realistic attack conditions, we vary the signal-to-noise ratio between speech and trigger, demonstrating a trade-off between stealth and effectiveness. We further extend the attack to the speaker verification task by selecting the most similar training speaker - based on cosine similarity - as a proxy target. The attack is most effective when target and enrolled speaker pairs are highly similar, reaching success rates of up to 90% in such cases.

Multi-Target Backdoor Attacks Against Speaker Recognition

TL;DR

This work reveals a practical multi-target backdoor vulnerability in speaker recognition by injecting natural clicking sound triggers into training data to impersonate up to 50 target speakers in closed-set identification and to impersonate enrolled victims in open-set verification. The method uses dirty-label poisoning and position-independent triggers with variable SNR, demonstrating strong attack success in speaker identification and conditional success in speaker verification depending on embedding similarity between target and victim. Key findings include high ASR for multi-target SI (up to 95% for some sub-attacks, with average around 69% at 50 targets) and strong SV performance only when target and victim embeddings are highly similar (cosine similarity above ~0.80). The results emphasize significant security risks for outsourced model development and data collection pipelines, while highlighting limitations and directions for defense, such as sequential poisoning and real-world trigger injection scenarios.

Abstract

In this work, we propose a multi-target backdoor attack against speaker identification using position-independent clicking sounds as triggers. Unlike previous single-target approaches, our method targets up to 50 speakers simultaneously, achieving success rates of up to 95.04%. To simulate more realistic attack conditions, we vary the signal-to-noise ratio between speech and trigger, demonstrating a trade-off between stealth and effectiveness. We further extend the attack to the speaker verification task by selecting the most similar training speaker - based on cosine similarity - as a proxy target. The attack is most effective when target and enrolled speaker pairs are highly similar, reaching success rates of up to 90% in such cases.

Paper Structure

This paper contains 15 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Attack scenario for an $n$-target attack setup. Each sub-attack$_i$ uses the trigger $\texttt{click}_i$ and poisons $k$ speakers in the range $[i \cdot k, (i+1) \cdot k)$. Speaker $2 + i \cdot k$ is assigned as the target. The remaining speakers in $[n \cdot k, 5994)$ are kept clean.
  • Figure 2: The embeddings are compared to find the most similar pairs from both sets. The speaker from the training set will be referred to as the target. The enrolled speaker will be referred to as the victim.
  • Figure 3: Cosine similarity versus attack success rate for speaker verification attacks. The histogram shows the distribution of cosine similarity scores between each enrolled speaker from VoxCeleb1 and their most similar speaker in the VoxCeleb2 training set. The scatter plot presents the ASRs from the 20-target transferred and 20-target optimistic experiments.