Table of Contents
Fetching ...

Probabilistic Verification of Voice Anti-Spoofing Models

Evgeny Kushnir, Alexandr Kozodaev, Dmitrii Korzh, Mikhail Pautov, Oleg Kiriukhin, Oleg Y. Rogov

TL;DR

PV-VASM is proposed, a probabilistic framework for verifying the robustness of voice anti-spoofing models (VASMs) and derives a theoretical upper bound on the error probability and validate the method across diverse experimental settings, demonstrating its effectiveness as a practical robustness verification tool.

Abstract

Recent advances in generative models have amplified the risk of malicious misuse of speech synthesis technologies, enabling adversaries to impersonate target speakers and access sensitive resources. Although speech deepfake detection has progressed rapidly, most existing countermeasures lack formal robustness guarantees or fail to generalize to unseen generation techniques. We propose PV-VASM, a probabilistic framework for verifying the robustness of voice anti-spoofing models (VASMs). PV-VASM estimates the probability of misclassification under text-to-speech (TTS), voice cloning (VC), and parametric signal transformations. The approach is model-agnostic and enables robustness verification against unseen speech synthesis techniques and input perturbations. We derive a theoretical upper bound on the error probability and validate the method across diverse experimental settings, demonstrating its effectiveness as a practical robustness verification tool.

Probabilistic Verification of Voice Anti-Spoofing Models

TL;DR

PV-VASM is proposed, a probabilistic framework for verifying the robustness of voice anti-spoofing models (VASMs) and derives a theoretical upper bound on the error probability and validate the method across diverse experimental settings, demonstrating its effectiveness as a practical robustness verification tool.

Abstract

Recent advances in generative models have amplified the risk of malicious misuse of speech synthesis technologies, enabling adversaries to impersonate target speakers and access sensitive resources. Although speech deepfake detection has progressed rapidly, most existing countermeasures lack formal robustness guarantees or fail to generalize to unseen generation techniques. We propose PV-VASM, a probabilistic framework for verifying the robustness of voice anti-spoofing models (VASMs). PV-VASM estimates the probability of misclassification under text-to-speech (TTS), voice cloning (VC), and parametric signal transformations. The approach is model-agnostic and enables robustness verification against unseen speech synthesis techniques and input perturbations. We derive a theoretical upper bound on the error probability and validate the method across diverse experimental settings, demonstrating its effectiveness as a practical robustness verification tool.
Paper Structure (21 sections, 22 equations, 7 figures, 4 tables, 2 algorithms)

This paper contains 21 sections, 22 equations, 7 figures, 4 tables, 2 algorithms.

Figures (7)

  • Figure 1: Dependence of PCA on $(m, n, k)$ for background noise perturbations with $\operatorname{SNR} \in [15,30]$. The confidence level is set to $\alpha=10^{-6}$. Curves sharing the same color correspond to an identical computational budget $m$, while line styles and marker types indicate variations in $n$ and $k$, respectively.
  • Figure 2: Dependence of PCA on $\alpha$ for background noise perturbations with $\operatorname{SNR} \in [15,30]$. $m=6000,~n=1000,~k=6$ are fixed.
  • Figure 3: Dependence of PCA on $(m, n, k)$ for the gain adjustment transform with $\gamma \in [-10,20]~\operatorname{dB}$. The confidence level is set to $\alpha=10^{-6}$. Curves sharing the same color correspond to the same augmentation budget $m$, while line styles and marker types indicate variations in $n$ and $k$, respectively.
  • Figure 4: Dependence of PCA on $\alpha$ for the gain adjustment transform with $\gamma \in [-10,20] \operatorname{dB}$. The values $m=20000,~n=1000,~k=20$ are fixed.
  • Figure 5: Dependence of PCA on $(m, n, k)$ for the low pass filter with the cutoff frequency $\omega_{max}$ is randomly sampled from $[2500, 3000] \operatorname{Hz}$ range. The confidence level is set to $\alpha=10^{-6}$. Curves sharing the same color correspond to the same augmentation budget $m$, while line styles and marker types indicate variations in $n$ and $k$, respectively.
  • ...and 2 more figures