Table of Contents
Fetching ...

A Universal Identity Backdoor Attack against Speaker Verification based on Siamese Network

Haodong Zhao, Wei Du, Junjie Guo, Gongshen Liu

TL;DR

The paper tackles security vulnerabilities in speaker verification systems trained on potentially untrusted data. It introduces a universal identity backdoor for Siamese networks that enables an attacker to impersonate any enrolled speaker in an open-set setting by poisoning the GE2E loss with attacker-centric examples, controlled via data-selection and poisoning schemes. The attack employs three data-selection methods and two poisoning strategies, achieving high attack success rates (e.g., $ASR\approx 92\%$ on TIMIT and $87\%$ on VoxCeleb1) while preserving benign EER and outperforming baselines. This work highlights a critical vulnerability in SV pipelines and motivates the development of robust training procedures and data provenance checks to harden speaker verification systems.

Abstract

Speaker verification has been widely used in many authentication scenarios. However, training models for speaker verification requires large amounts of data and computing power, so users often use untrustworthy third-party data or deploy third-party models directly, which may create security risks. In this paper, we propose a backdoor attack for the above scenario. Specifically, for the Siamese network in the speaker verification system, we try to implant a universal identity in the model that can simulate any enrolled speaker and pass the verification. So the attacker does not need to know the victim, which makes the attack more flexible and stealthy. In addition, we design and compare three ways of selecting attacker utterances and two ways of poisoned training for the GE2E loss function in different scenarios. The results on the TIMIT and Voxceleb1 datasets show that our approach can achieve a high attack success rate while guaranteeing the normal verification accuracy. Our work reveals the vulnerability of the speaker verification system and provides a new perspective to further improve the robustness of the system.

A Universal Identity Backdoor Attack against Speaker Verification based on Siamese Network

TL;DR

The paper tackles security vulnerabilities in speaker verification systems trained on potentially untrusted data. It introduces a universal identity backdoor for Siamese networks that enables an attacker to impersonate any enrolled speaker in an open-set setting by poisoning the GE2E loss with attacker-centric examples, controlled via data-selection and poisoning schemes. The attack employs three data-selection methods and two poisoning strategies, achieving high attack success rates (e.g., on TIMIT and on VoxCeleb1) while preserving benign EER and outperforming baselines. This work highlights a critical vulnerability in SV pipelines and motivates the development of robust training procedures and data provenance checks to harden speaker verification systems.

Abstract

Speaker verification has been widely used in many authentication scenarios. However, training models for speaker verification requires large amounts of data and computing power, so users often use untrustworthy third-party data or deploy third-party models directly, which may create security risks. In this paper, we propose a backdoor attack for the above scenario. Specifically, for the Siamese network in the speaker verification system, we try to implant a universal identity in the model that can simulate any enrolled speaker and pass the verification. So the attacker does not need to know the victim, which makes the attack more flexible and stealthy. In addition, we design and compare three ways of selecting attacker utterances and two ways of poisoned training for the GE2E loss function in different scenarios. The results on the TIMIT and Voxceleb1 datasets show that our approach can achieve a high attack success rate while guaranteeing the normal verification accuracy. Our work reveals the vulnerability of the speaker verification system and provides a new perspective to further improve the robustness of the system.
Paper Structure (10 sections, 1 equation, 4 figures, 1 table)

This paper contains 10 sections, 1 equation, 4 figures, 1 table.

Figures (4)

  • Figure 1: The inner structure of the to-be-attacked speaker verification model. The front-end feature extractor consists of convolutional layer and pooling layer, converting low-dimensional MFCCsahidullah2012design vector to high-dimensional speaker embedding.
  • Figure 2: Flow chart of Enrolling process (top) and Inference process (bottom).
  • Figure 3: Two methods of injecting the backdoor into training dataset. Supposing in each batch there are N speakers with M utterances, using different colors to distinguish speakers, the left one is replacing method and the right one is inserting method.
  • Figure 4: The EER(%) and ASR(%) of different methods on the TIMIT and VoxCeleb dataset.