Certification of Speaker Recognition Models to Additive Perturbations

Dmitrii Korzh; Elvir Karimov; Mikhail Pautov; Oleg Y. Rogov; Ivan Oseledets

Certification of Speaker Recognition Models to Additive Perturbations

Dmitrii Korzh, Elvir Karimov, Mikhail Pautov, Oleg Y. Rogov, Ivan Oseledets

TL;DR

This work tackles provable robustness for speaker recognition under additive perturbations by transferring randomized smoothing from the image domain to audio embeddings. It formulates a few-shot enrollment setting with embedding centroids and derives a certification guarantee for the smoothed embedding g(x) that bounds perturbations by a radius R(φ, σ). The authors implement a practical pipeline using sample-based estimates and Hoeffding bounds, achieving state-of-the-art certified accuracy on VoxCeleb1/2 and comparing favorably to prior certified methods in few-shot settings. The results demonstrate the potential of certified robustness for voice biometrics, with implications for secure access and privacy-preserving speech technologies.

Abstract

Speaker recognition technology is applied to various tasks, from personal virtual assistants to secure access systems. However, the robustness of these systems against adversarial attacks, particularly to additive perturbations, remains a significant challenge. In this paper, we pioneer applying robustness certification techniques to speaker recognition, initially developed for the image domain. Our work covers this gap by transferring and improving randomized smoothing certification techniques against norm-bounded additive perturbations for classification and few-shot learning tasks to speaker recognition. We demonstrate the effectiveness of these methods on VoxCeleb 1 and 2 datasets for several models. We expect this work to improve the robustness of voice biometrics and accelerate the research of certification methods in the audio domain.

Certification of Speaker Recognition Models to Additive Perturbations

TL;DR

Abstract

Paper Structure (22 sections, 2 theorems, 32 equations, 10 figures, 1 algorithm)

This paper contains 22 sections, 2 theorems, 32 equations, 10 figures, 1 algorithm.

Introduction
Related Work
Speaker Recognition
Adversarial Attacks
Empirical and Certified Defenses
Methodology
Speaker Recognition as a Few-Shot Problem
Problem Statement and Certification for Vector Functions
Implementation Details
Sample Mean Instead of Expectation
Hoeffding Confidence Interval and Error Probability
Distances to the Centroids.
Estimation of $\hat{\phi}$.
Error Probability of Algorithm \ref{['alg:cr']}.
Experiments
...and 7 more sections

Key Result

Theorem 1

For all additive perturbations $\delta: \|\delta\|_2 \le R(\phi, \sigma) = \sigma \Phi^{-1} (\phi)$ where $R(\phi,\sigma)$ is called certified radius of $g$ at $x.$

Figures (10)

Figure 1: The scheme illustrating the proposed algorithm. The algorithm requires an audio sample $x$, base model $f$, and the set of centroids $S^c=\{c_1, \dots, c_K\}$. In the Figure, $\hat{g}(x)$ corresponds to the estimation of the smoothed embedding $g(x)$ from Eq. \ref{['eq:rs_model_smooth']} computed in the form from Eq. \ref{['eq:g_mc']}. When executed, Algorithm \ref{['alg:cr']} computes the confidence interval $(l_i, u_i)$ for the distance between $\hat{g}(x)$ and corresponding centroid $c_i$ for all $i \in [1,\dots,K]$. Then, given sorted confidence intervals $\{(l_{i_1}, u_{i_1}), \dots, (l_{i_K}, u_{i_K})\}$, two closest centroids, $c_{i_1}$ and $c_{i_2}$, are determined. The last step of the algorithm is the computation of the lower bound $R(\hat{\phi}(\hat{g},c_{i_1}, c_{i_2}))$ on the certified radius $R({\phi}(g,c_{i_1}, c_{i_2}))$ from the Theorem \ref{['th:robustness_guarantee']}.
Figure 2: Pyannote model. Few-shot setting. Dependency of certified accuracy on the variance $\sigma$ of the additive noise, confidence level $\alpha$, and maximum number of noise samples $N_{\text{max}}$.
Figure 3: Pyannote model. Few-shot setting. Dependency of certified accuracy on number $M$ of audios of a single speaker, number of enrolled speakers $K$, and the audio length in seconds.
Figure 4: ECAPA-TDNN model. Few-shot setting. Dependency of certified accuracy on the variance $\sigma$ of the additive noise, confidence level $\alpha$, and maximum number of noise samples $N_{\text{max}}$.
Figure 5: ECAPA-TDNN model. Few-shot setting. Dependency of certified accuracy on number $M$ of audios of a single speaker, number of enrolled speakers $K$, and the audio length in seconds.
...and 5 more figures

Theorems & Definitions (5)

Theorem 1: Main result
Remark 1
Remark 2
Theorem 1: Restated
proof

Certification of Speaker Recognition Models to Additive Perturbations

TL;DR

Abstract

Certification of Speaker Recognition Models to Additive Perturbations

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (5)