Table of Contents
Fetching ...

Joint Optimization of Speaker and Spoof Detectors for Spoofing-Robust Automatic Speaker Verification

Oğuzhan Kurnaz, Jagabandhu Mishra, Tomi H. Kinnunen, Cemal Hanilçi

Abstract

Spoofing-robust speaker verification (SASV) combines the tasks of speaker and spoof detection to authenticate speakers under adversarial settings. Many SASV systems rely on fusion of speaker and spoof cues at embedding, score or decision levels, based on independently trained subsystems. In this study, we respect similar modularity of the two subsystems, by integrating their outputs using trainable back-end classifiers. In particular, we explore various approaches for directly optimizing the back-end for the recently-proposed SASV performance metric (a-DCF) as a training objective. Our experiments on the ASVspoof 5 dataset demonstrate two important findings: (i) nonlinear score fusion consistently improves a-DCF over linear fusion, and (ii) the combination of weighted cosine scoring for speaker detection with SSL-AASIST for spoof detection achieves state-of-the-art performance, reducing min a-DCF to 0.196 and SPF-EER to 7.6%. These contributions highlight the importance of modular design, calibrated integration, and task-aligned optimization for advancing robust and interpretable SASV systems.

Joint Optimization of Speaker and Spoof Detectors for Spoofing-Robust Automatic Speaker Verification

Abstract

Spoofing-robust speaker verification (SASV) combines the tasks of speaker and spoof detection to authenticate speakers under adversarial settings. Many SASV systems rely on fusion of speaker and spoof cues at embedding, score or decision levels, based on independently trained subsystems. In this study, we respect similar modularity of the two subsystems, by integrating their outputs using trainable back-end classifiers. In particular, we explore various approaches for directly optimizing the back-end for the recently-proposed SASV performance metric (a-DCF) as a training objective. Our experiments on the ASVspoof 5 dataset demonstrate two important findings: (i) nonlinear score fusion consistently improves a-DCF over linear fusion, and (ii) the combination of weighted cosine scoring for speaker detection with SSL-AASIST for spoof detection achieves state-of-the-art performance, reducing min a-DCF to 0.196 and SPF-EER to 7.6%. These contributions highlight the importance of modular design, calibrated integration, and task-aligned optimization for advancing robust and interpretable SASV systems.

Paper Structure

This paper contains 30 sections, 27 equations, 7 figures, 13 tables.

Figures (7)

  • Figure 1: Overview of the proposed modular and jointly optimized SASV framework. Embeddings, extracted from enrollment and test utterances ($\mathbf{X}_\text{enr}$ and $\mathbf{X}_\text{tst}$) using frozen () ASV and CM extractors, are fed into the modular SASV system (red dashed box). Its internal modules are jointly optimized with the SASV loss $\mathcal{L}_{\mathrm{sasv}}$ (see Fig. \ref{['fig:your_label']} for details), producing the final SASV score $s_{\mathrm{sasv}}$. Modules marked with denote trainable components.
  • Figure 2: Contour plot of calibrated ASV and CM scores for target, non-target, and spoof trials from simulated data. The dashed purple line shows the linear decision boundary assuming uniform priors ($\pi_{\text{tar}} = \pi_{\text{non}} = \pi_{\text{spf}} = \frac{1}{3}$), while the solid gray line shows the non-linear boundary from \ref{['eq:basic-nonlinear-fusion']} under the same equal-prior assumption. The blue and orange boundaries correspond to priors $(0.9, 0.05, 0.05)$ and $(0.995, 0.004, 0.001)$, respectively. The numerical values on the right-hand side indicate the error rates for each decision boundary, where "miss" denotes the rejection of target trials and "FA" (false alarm) denotes the acceptance of non-legitimate trials (non-targets and spoofs). The resulting "Dec. Cost" is computed as the sum of these two error terms.
  • Figure 3: Illustration of the three proposed modular SASV architectures. Each system comprises four components: (i) an ASV branch for extracting speaker embeddings and computing the ASV score ($s_{\mathrm{asv}}$), (ii) a CM branch for detecting spoofed speech via the CM score ($s_{\mathrm{cm}}$), (iii) a score fusion module that integrates the ASV and CM scores, and (iv) an optimization strategy that either jointly or separately tunes the system components.
  • Figure 4: DET curves comparing conventional non-linear score fusion (red) and the proposed approach (blue). The left plot shows the tradeoff between false acceptance of non-target trials ($P^{\text{non.bon}}_{\text{fa}}$) and missed detections of target trials ($P^{\text{tar.bon}}_{\text{miss}}$), corresponding to the conventional ASV performance. The right plot shows the tradeoff between false acceptance of spoof trials ($P^{\text{spf}}_{\text{fa}}$) and missed detections of target trials ($P^{\text{tar}}_{\text{miss}}$), highlighting the system’s spoofing robustness.
  • Figure 5: Comparison of non-linear score fusion (left) and the proposed approach (right) using ReDimNet as the ASV system and SSL-AASIST as the CM system. The plots show the score distributions for target, non-target, and spoof trials. Vertical dashed lines denote the operating thresholds: $\tau^{\text{dev}}_{\text{sasv}}$ (blue dashed line) represents the fixed operating point optimized on the pooled development set and used for the computation of actual metrics, while $\tau^{\text{eval}}_{\text{sasv}}$ (orange dotted line) indicates the theoretical optimal threshold for the evaluation set
  • ...and 2 more figures