Table of Contents
Fetching ...

Explaining Speaker and Spoof Embeddings via Probing

Xuechen Liu, Junichi Yamagishi, Md Sahidullah, Tomi kinnunen

TL;DR

The paper tackles the problem of explainability for spoofing embeddings, asking what speaker- or spoof-related information these embeddings retain. It introduces a probing framework based on a simple MLP to predict meta traits (classification) and acoustic traits (regression) from ASV and CM embeddings derived from ASVspoof 2019 LA data. The study finds that spoof embeddings largely discard most speaker meta-traits except gender, while still retaining several spoof-related meta and acoustic traits; gender invariance in CM decisions appears to be a deliberate robustness mechanism. These insights have practical implications for designing more robust countermeasures and suggest avenues for closer integration between ASV and CM representations to leverage preserved information.

Abstract

This study investigates the explainability of embedding representations, specifically those used in modern audio spoofing detection systems based on deep neural networks, known as spoof embeddings. Building on established work in speaker embedding explainability, we examine how well these spoof embeddings capture speaker-related information. We train simple neural classifiers using either speaker or spoof embeddings as input, with speaker-related attributes as target labels. These attributes are categorized into two groups: metadata-based traits (e.g., gender, age) and acoustic traits (e.g., fundamental frequency, speaking rate). Our experiments on the ASVspoof 2019 LA evaluation set demonstrate that spoof embeddings preserve several key traits, including gender, speaking rate, F0, and duration. Further analysis of gender and speaking rate indicates that the spoofing detector partially preserves these traits, potentially to ensure the decision process remains robust against them.

Explaining Speaker and Spoof Embeddings via Probing

TL;DR

The paper tackles the problem of explainability for spoofing embeddings, asking what speaker- or spoof-related information these embeddings retain. It introduces a probing framework based on a simple MLP to predict meta traits (classification) and acoustic traits (regression) from ASV and CM embeddings derived from ASVspoof 2019 LA data. The study finds that spoof embeddings largely discard most speaker meta-traits except gender, while still retaining several spoof-related meta and acoustic traits; gender invariance in CM decisions appears to be a deliberate robustness mechanism. These insights have practical implications for designing more robust countermeasures and suggest avenues for closer integration between ASV and CM representations to leverage preserved information.

Abstract

This study investigates the explainability of embedding representations, specifically those used in modern audio spoofing detection systems based on deep neural networks, known as spoof embeddings. Building on established work in speaker embedding explainability, we examine how well these spoof embeddings capture speaker-related information. We train simple neural classifiers using either speaker or spoof embeddings as input, with speaker-related attributes as target labels. These attributes are categorized into two groups: metadata-based traits (e.g., gender, age) and acoustic traits (e.g., fundamental frequency, speaking rate). Our experiments on the ASVspoof 2019 LA evaluation set demonstrate that spoof embeddings preserve several key traits, including gender, speaking rate, F0, and duration. Further analysis of gender and speaking rate indicates that the spoofing detector partially preserves these traits, potentially to ensure the decision process remains robust against them.

Paper Structure

This paper contains 16 sections, 1 equation, 5 figures.

Figures (5)

  • Figure 1: The classification accuracy of MLP trained using ASV and CM embeddings on various speaker and spoof-related traits available in the metadata. The brackets in the x-axis indicate their corresponding setup. The black error bars present the statistical significance of the accuracy by confidence interval measurement.
  • Figure 2: The $R^{2}$ values of MLP trained using ASV and CM embeddings on acoustic speaker traits. The brackets in the x-axis indicate their corresponding setup. The bold italic values indicate the results that hold p-value less than 0.01.
  • Figure 3: Score distribution of AASIST CM with respect to the gender and bonafide/spoofed cases.
  • Figure 4: Gender-wise distance measurement between bonafide and spoof representations, aggregated per speaker, for the encoder output and embeddings from AASIST CM.
  • Figure 5: CM performance in EER(%), with repsect to different speed perturbation rate. The perturbation rate being 1.0 is the baseline system, and the CM EER is same as in aasist2022.