Table of Contents
Fetching ...

Revisiting and Improving Scoring Fusion for Spoofing-aware Speaker Verification Using Compositional Data Analysis

Xin Wang, Tomi Kinnunen, Kong Aik Lee, Paul-Gauthier Noé, Junichi Yamagishi

TL;DR

The paper revisits SASV score fusion through decision-theoretic and compositional data analysis lenses. It shows that score calibration before fusion is beneficial, advocates fusing LLRs rather than raw scores, and demonstrates that a nonlinear LLR fusion rule yields superior SASV discrimination compared to linear methods, with strong results on the SASV challenge data. It also links Gaussian back-end fusion to the optimal decision formulation under certain priors and costs, offering practical guidance for designing robust spoofing-aware speaker verification systems. Overall, the work provides a principled framework for SASV fusion that improves robustness to zero-effort imposters and spoofing attacks while offering actionable calibration and fusion strategies.

Abstract

Fusing outputs from automatic speaker verification (ASV) and spoofing countermeasure (CM) is expected to make an integrated system robust to zero-effort imposters and synthesized spoofing attacks. Many score-level fusion methods have been proposed, but many remain heuristic. This paper revisits score-level fusion using tools from decision theory and presents three main findings. First, fusion by summing the ASV and CM scores can be interpreted on the basis of compositional data analysis, and score calibration before fusion is essential. Second, the interpretation leads to an improved fusion method that linearly combines the log-likelihood ratios of ASV and CM. However, as the third finding reveals, this linear combination is inferior to a non-linear one in making optimal decisions. The outcomes of these findings, namely, the score calibration before fusion, improved linear fusion, and better non-linear fusion, were found to be effective on the SASV challenge database.

Revisiting and Improving Scoring Fusion for Spoofing-aware Speaker Verification Using Compositional Data Analysis

TL;DR

The paper revisits SASV score fusion through decision-theoretic and compositional data analysis lenses. It shows that score calibration before fusion is beneficial, advocates fusing LLRs rather than raw scores, and demonstrates that a nonlinear LLR fusion rule yields superior SASV discrimination compared to linear methods, with strong results on the SASV challenge data. It also links Gaussian back-end fusion to the optimal decision formulation under certain priors and costs, offering practical guidance for designing robust spoofing-aware speaker verification systems. Overall, the work provides a principled framework for SASV fusion that improves robustness to zero-effort imposters and spoofing attacks while offering actionable calibration and fusion strategies.

Abstract

Fusing outputs from automatic speaker verification (ASV) and spoofing countermeasure (CM) is expected to make an integrated system robust to zero-effort imposters and synthesized spoofing attacks. Many score-level fusion methods have been proposed, but many remain heuristic. This paper revisits score-level fusion using tools from decision theory and presents three main findings. First, fusion by summing the ASV and CM scores can be interpreted on the basis of compositional data analysis, and score calibration before fusion is essential. Second, the interpretation leads to an improved fusion method that linearly combines the log-likelihood ratios of ASV and CM. However, as the third finding reveals, this linear combination is inferior to a non-linear one in making optimal decisions. The outcomes of these findings, namely, the score calibration before fusion, improved linear fusion, and better non-linear fusion, were found to be effective on the SASV challenge database.
Paper Structure (24 sections, 48 equations, 4 figures, 6 tables)

This paper contains 24 sections, 48 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Example SASV with score-level fusion (B1 in jung22c_interspeech)
  • Figure 2: Bifurcating tree for ternary hypothesis test based on compositional data analysis noe:tel-04264175
  • Figure 3: Scatter plot of ASV $\mathsf{llr}^{\mathsf{tar.bf}{}}_{\mathsf{non.bf}}(\boldsymbol{s})$ and CM $\mathsf{llr}^{\mathsf{tar.bf}{}}_{\mathsf{spf}{}}(\boldsymbol{s})$ from simulated data. Dashed and solid lines are decision boundaries based on Ineqs. (\ref{['eq:action_fused_lr']}) and (\ref{['eq:action_optimal_llr']}), respectively, given true flat prior. Green solid line is based on Ineq. (\ref{['eq:action_optimal_llr']}) but mis-matched priors ${\pi}_\mathsf{spf}{}=0.05, {\pi}_\mathsf{non.bf}{}=0.05, {\pi}_\mathsf{tar.bf}{}=0.9$. Note that LLRs rather than raw scores are plotted.
  • Figure 4: Distributions of CM, ASV, and fused SASV scores. Bona fide data of target speakers ($\mathsf{tar.bf}{}$), those of non-target speakers ($\mathsf{non.bf}{}$), and spoofed data ($\mathsf{spf}{}$) are in different colors. Each vertical line in bottom plane marks SASV-EER threshold.