Table of Contents
Fetching ...

Novel Loss-Enhanced Universal Adversarial Patches for Sustainable Speaker Privacy

Elvir Karimov, Alexander Varlamov, Danil Ivanov, Dmitrii Korzh, Oleg Y. Rogov

TL;DR

This work tackles privacy in speaker recognition by enhancing universal adversarial patches (UAPs) for speaker anonymization. It introduces an Exponential TV loss to preserve imperceptibility, and a length-agnostic, tiling-based UAP generation procedure evaluated under a rigorous length-agnostic protocol. Empirical results on VoxCeleb2 show that the proposed loss improves the trade-off between fooling rate and audio quality (higher SNR and PESQ) while maintaining robust performance across varying audio lengths, outperforming prior UAP methods. The approach advances practical, real-world speaker privacy solutions by enabling durable, low-distortion anonymization across diverse utterance lengths and models.

Abstract

Deep learning voice models are commonly used nowadays, but the safety processing of personal data, such as human identity and speech content, remains suspicious. To prevent malicious user identification, speaker anonymization methods were proposed. Current methods, particularly based on universal adversarial patch (UAP) applications, have drawbacks such as significant degradation of audio quality, decreased speech recognition quality, low transferability across different voice biometrics models, and performance dependence on the input audio length. To mitigate these drawbacks, in this work, we introduce and leverage the novel Exponential Total Variance (TV) loss function and provide experimental evidence that it positively affects UAP strength and imperceptibility. Moreover, we present a novel scalable UAP insertion procedure and demonstrate its uniformly high performance for various audio lengths.

Novel Loss-Enhanced Universal Adversarial Patches for Sustainable Speaker Privacy

TL;DR

This work tackles privacy in speaker recognition by enhancing universal adversarial patches (UAPs) for speaker anonymization. It introduces an Exponential TV loss to preserve imperceptibility, and a length-agnostic, tiling-based UAP generation procedure evaluated under a rigorous length-agnostic protocol. Empirical results on VoxCeleb2 show that the proposed loss improves the trade-off between fooling rate and audio quality (higher SNR and PESQ) while maintaining robust performance across varying audio lengths, outperforming prior UAP methods. The approach advances practical, real-world speaker privacy solutions by enabling durable, low-distortion anonymization across diverse utterance lengths and models.

Abstract

Deep learning voice models are commonly used nowadays, but the safety processing of personal data, such as human identity and speech content, remains suspicious. To prevent malicious user identification, speaker anonymization methods were proposed. Current methods, particularly based on universal adversarial patch (UAP) applications, have drawbacks such as significant degradation of audio quality, decreased speech recognition quality, low transferability across different voice biometrics models, and performance dependence on the input audio length. To mitigate these drawbacks, in this work, we introduce and leverage the novel Exponential Total Variance (TV) loss function and provide experimental evidence that it positively affects UAP strength and imperceptibility. Moreover, we present a novel scalable UAP insertion procedure and demonstrate its uniformly high performance for various audio lengths.

Paper Structure

This paper contains 8 sections, 1 theorem, 14 equations, 3 figures, 1 table.

Key Result

Theorem 1

Suppose there exists a subspace $S \subset \mathbb{R}^d$ ($\dim S = m \ll d$) such that for most $x \in \mathcal{X}$ the Hessian $H_z = \nabla^2 \mathcal{M}(z)$ of the margin at $z = x + r(x)$ satisfies: And $J_f(x)$ maps tiled perturbations to $S$: Then $\forall \beta \in (0,1)$, $\exists\hat{\delta} \in \mathbb{R}^l$ with $||\hat{\delta}|| \leq \epsilon$ such that: where $\sigma^2$ bounds the

Figures (3)

  • Figure 1: The distribution of loudness levels ($\ell_2$-norm) across the dataset.
  • Figure 2: Comparison of the proposed UAP (with $L_{\text{Exp TV}}$) approach performance with that of israel across different test audio lengths.
  • Figure 3: Cosine similarity distributions between original ("Orig") audio and Enrolment vectors ("Enroll"), between anonymized audio ("Anon") and Enrolment vectors, and between different anonymized audio of the same speakers.

Theorems & Definitions (2)

  • Theorem 1: UAP Generalization
  • proof