Multi-Target Backdoor Attacks Against Speaker Recognition
Alexandrine Fortier, Sonal Joshi, Thomas Thebaud, Jesús Villalba, Najim Dehak, Patrick Cardinal
TL;DR
This work reveals a practical multi-target backdoor vulnerability in speaker recognition by injecting natural clicking sound triggers into training data to impersonate up to 50 target speakers in closed-set identification and to impersonate enrolled victims in open-set verification. The method uses dirty-label poisoning and position-independent triggers with variable SNR, demonstrating strong attack success in speaker identification and conditional success in speaker verification depending on embedding similarity between target and victim. Key findings include high ASR for multi-target SI (up to 95% for some sub-attacks, with average around 69% at 50 targets) and strong SV performance only when target and victim embeddings are highly similar (cosine similarity above ~0.80). The results emphasize significant security risks for outsourced model development and data collection pipelines, while highlighting limitations and directions for defense, such as sequential poisoning and real-world trigger injection scenarios.
Abstract
In this work, we propose a multi-target backdoor attack against speaker identification using position-independent clicking sounds as triggers. Unlike previous single-target approaches, our method targets up to 50 speakers simultaneously, achieving success rates of up to 95.04%. To simulate more realistic attack conditions, we vary the signal-to-noise ratio between speech and trigger, demonstrating a trade-off between stealth and effectiveness. We further extend the attack to the speaker verification task by selecting the most similar training speaker - based on cosine similarity - as a proxy target. The attack is most effective when target and enrolled speaker pairs are highly similar, reaching success rates of up to 90% in such cases.
