Explaining Speaker and Spoof Embeddings via Probing
Xuechen Liu, Junichi Yamagishi, Md Sahidullah, Tomi kinnunen
TL;DR
The paper tackles the problem of explainability for spoofing embeddings, asking what speaker- or spoof-related information these embeddings retain. It introduces a probing framework based on a simple MLP to predict meta traits (classification) and acoustic traits (regression) from ASV and CM embeddings derived from ASVspoof 2019 LA data. The study finds that spoof embeddings largely discard most speaker meta-traits except gender, while still retaining several spoof-related meta and acoustic traits; gender invariance in CM decisions appears to be a deliberate robustness mechanism. These insights have practical implications for designing more robust countermeasures and suggest avenues for closer integration between ASV and CM representations to leverage preserved information.
Abstract
This study investigates the explainability of embedding representations, specifically those used in modern audio spoofing detection systems based on deep neural networks, known as spoof embeddings. Building on established work in speaker embedding explainability, we examine how well these spoof embeddings capture speaker-related information. We train simple neural classifiers using either speaker or spoof embeddings as input, with speaker-related attributes as target labels. These attributes are categorized into two groups: metadata-based traits (e.g., gender, age) and acoustic traits (e.g., fundamental frequency, speaking rate). Our experiments on the ASVspoof 2019 LA evaluation set demonstrate that spoof embeddings preserve several key traits, including gender, speaking rate, F0, and duration. Further analysis of gender and speaking rate indicates that the spoofing detector partially preserves these traits, potentially to ensure the decision process remains robust against them.
