Sparse Autoencoder Insights on Voice Embeddings
Daniel Pluth, Yu Zhou, Vijay K. Gurbani
TL;DR
This paper tackles explainability for audio embeddings by applying sparse autoencoders to Titanet speaker embeddings. It trains SAEs to expand to a sparse latent space of dimension $L$ and identifies latent indices corresponding to mono-semantic attributes, such as language or IVR music, and demonstrates steering by modifying latent activations. Key contributions include showing mono-semantic features in non-textual embeddings, evidence of feature splitting and steering analogous to LLM studies, and evaluation on telephony Titanet data with sub-1% EER and high language/music discrimination. Limitations include a relatively small latent-to-embedding ratio, potential dead latents in some configurations, and data-domain biases; future work explores universality across embedding models and application to Whisper embeddings.
Abstract
Recent advances in explainable machine learning have highlighted the potential of sparse autoencoders in uncovering mono-semantic features in densely encoded embeddings. While most research has focused on Large Language Model (LLM) embeddings, the applicability of this technique to other domains remains largely unexplored. This study applies sparse autoencoders to speaker embeddings generated from a Titanet model, demonstrating the effectiveness of this technique in extracting mono-semantic features from non-textual embedded data. The results show that the extracted features exhibit characteristics similar to those found in LLM embeddings, including feature splitting and steering. The analysis reveals that the autoencoder can identify and manipulate features such as language and music, which are not evident in the original embedding. The findings suggest that sparse autoencoders can be a valuable tool for understanding and interpreting embedded data in many domains, including audio-based speaker recognition.
