Table of Contents
Fetching ...

Sparse Autoencoder Insights on Voice Embeddings

Daniel Pluth, Yu Zhou, Vijay K. Gurbani

TL;DR

This paper tackles explainability for audio embeddings by applying sparse autoencoders to Titanet speaker embeddings. It trains SAEs to expand to a sparse latent space of dimension $L$ and identifies latent indices corresponding to mono-semantic attributes, such as language or IVR music, and demonstrates steering by modifying latent activations. Key contributions include showing mono-semantic features in non-textual embeddings, evidence of feature splitting and steering analogous to LLM studies, and evaluation on telephony Titanet data with sub-1% EER and high language/music discrimination. Limitations include a relatively small latent-to-embedding ratio, potential dead latents in some configurations, and data-domain biases; future work explores universality across embedding models and application to Whisper embeddings.

Abstract

Recent advances in explainable machine learning have highlighted the potential of sparse autoencoders in uncovering mono-semantic features in densely encoded embeddings. While most research has focused on Large Language Model (LLM) embeddings, the applicability of this technique to other domains remains largely unexplored. This study applies sparse autoencoders to speaker embeddings generated from a Titanet model, demonstrating the effectiveness of this technique in extracting mono-semantic features from non-textual embedded data. The results show that the extracted features exhibit characteristics similar to those found in LLM embeddings, including feature splitting and steering. The analysis reveals that the autoencoder can identify and manipulate features such as language and music, which are not evident in the original embedding. The findings suggest that sparse autoencoders can be a valuable tool for understanding and interpreting embedded data in many domains, including audio-based speaker recognition.

Sparse Autoencoder Insights on Voice Embeddings

TL;DR

This paper tackles explainability for audio embeddings by applying sparse autoencoders to Titanet speaker embeddings. It trains SAEs to expand to a sparse latent space of dimension and identifies latent indices corresponding to mono-semantic attributes, such as language or IVR music, and demonstrates steering by modifying latent activations. Key contributions include showing mono-semantic features in non-textual embeddings, evidence of feature splitting and steering analogous to LLM studies, and evaluation on telephony Titanet data with sub-1% EER and high language/music discrimination. Limitations include a relatively small latent-to-embedding ratio, potential dead latents in some configurations, and data-domain biases; future work explores universality across embedding models and application to Whisper embeddings.

Abstract

Recent advances in explainable machine learning have highlighted the potential of sparse autoencoders in uncovering mono-semantic features in densely encoded embeddings. While most research has focused on Large Language Model (LLM) embeddings, the applicability of this technique to other domains remains largely unexplored. This study applies sparse autoencoders to speaker embeddings generated from a Titanet model, demonstrating the effectiveness of this technique in extracting mono-semantic features from non-textual embedded data. The results show that the extracted features exhibit characteristics similar to those found in LLM embeddings, including feature splitting and steering. The analysis reveals that the autoencoder can identify and manipulate features such as language and music, which are not evident in the original embedding. The findings suggest that sparse autoencoders can be a valuable tool for understanding and interpreting embedded data in many domains, including audio-based speaker recognition.

Paper Structure

This paper contains 21 sections, 3 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Embedding $e$ is reconstructed as $\epsilon$ via latent vector $v$.
  • Figure 2: Performance of the top latent index for classifying language across models with varying latent dimension and TopK activation.
  • Figure 3: Performance of the top latent index for classifying music across models with varying latent dimension and TopK activation.
  • Figure 4: The movement of the different language and gender samples in and out of the predominant Spanish language index.
  • Figure 5: Distribution of relative similarity scores before and after Spanish feature steering.
  • ...and 1 more figures