Table of Contents
Fetching ...

What Do Self-Supervised Speech and Speaker Models Learn? New Findings From a Cross Model Layer-Wise Analysis

Takanori Ashihara, Marc Delcroix, Takafumi Moriya, Kohei Matsuura, Taichi Asami, Yusuke Ijima

TL;DR

The paper investigates what information is encoded by speech SSL, speaker SSL, and fully supervised speaker models using SUPERB-style probing and a cross-model, layer-wise similarity framework. It trains ECAPA-TDNN-based speaker models and a DINO-based speaker SSL variant, and evaluates them alongside HuBERT and WavLM speech SSL models on a suite of probing tasks, employing a weighted layer-sum predictor $ igl( extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle extstyle igr) $ and LinCKA to map information distribution across layers. The results show that non-speaker information such as content and semantics is not strongly correlated with speaker-representation quality; speech SSL layers progressively capture linguistic content in deeper layers, while speaker SSL and supervised speaker models disentangle phoneme information early but largely underrepresent linguistic cues in later layers. These findings suggest that speaker-focused objectives can yield robust speaker embeddings without requiring deep linguistic representations, while speech SSL models develop richer linguistic structure in their deeper layers, guiding more efficient architectural and training designs for speech-related tasks. Overall, the work clarifies how different SSL and supervised strategies distribute information across layers, informing when and how to leverage layer-wise representations for downstream tasks and model efficiency.

Abstract

Self-supervised learning (SSL) has attracted increased attention for learning meaningful speech representations. Speech SSL models, such as WavLM, employ masked prediction training to encode general-purpose representations. In contrast, speaker SSL models, exemplified by DINO-based models, adopt utterance-level training objectives primarily for speaker representation. Understanding how these models represent information is essential for refining model efficiency and effectiveness. Unlike the various analyses of speech SSL, there has been limited investigation into what information speaker SSL captures and how its representation differs from speech SSL or other fully-supervised speaker models. This paper addresses these fundamental questions. We explore the capacity to capture various speech properties by applying SUPERB evaluation probing tasks to speech and speaker SSL models. We also examine which layers are predominantly utilized for each task to identify differences in how speech is represented. Furthermore, we conduct direct comparisons to measure the similarities between layers within and across models. Our analysis unveils that 1) the capacity to represent content information is somewhat unrelated to enhanced speaker representation, 2) specific layers of speech SSL models would be partly specialized in capturing linguistic information, and 3) speaker SSL models tend to disregard linguistic information but exhibit more sophisticated speaker representation.

What Do Self-Supervised Speech and Speaker Models Learn? New Findings From a Cross Model Layer-Wise Analysis

TL;DR

The paper investigates what information is encoded by speech SSL, speaker SSL, and fully supervised speaker models using SUPERB-style probing and a cross-model, layer-wise similarity framework. It trains ECAPA-TDNN-based speaker models and a DINO-based speaker SSL variant, and evaluates them alongside HuBERT and WavLM speech SSL models on a suite of probing tasks, employing a weighted layer-sum predictor and LinCKA to map information distribution across layers. The results show that non-speaker information such as content and semantics is not strongly correlated with speaker-representation quality; speech SSL layers progressively capture linguistic content in deeper layers, while speaker SSL and supervised speaker models disentangle phoneme information early but largely underrepresent linguistic cues in later layers. These findings suggest that speaker-focused objectives can yield robust speaker embeddings without requiring deep linguistic representations, while speech SSL models develop richer linguistic structure in their deeper layers, guiding more efficient architectural and training designs for speech-related tasks. Overall, the work clarifies how different SSL and supervised strategies distribute information across layers, informing when and how to leverage layer-wise representations for downstream tasks and model efficiency.

Abstract

Self-supervised learning (SSL) has attracted increased attention for learning meaningful speech representations. Speech SSL models, such as WavLM, employ masked prediction training to encode general-purpose representations. In contrast, speaker SSL models, exemplified by DINO-based models, adopt utterance-level training objectives primarily for speaker representation. Understanding how these models represent information is essential for refining model efficiency and effectiveness. Unlike the various analyses of speech SSL, there has been limited investigation into what information speaker SSL captures and how its representation differs from speech SSL or other fully-supervised speaker models. This paper addresses these fundamental questions. We explore the capacity to capture various speech properties by applying SUPERB evaluation probing tasks to speech and speaker SSL models. We also examine which layers are predominantly utilized for each task to identify differences in how speech is represented. Furthermore, we conduct direct comparisons to measure the similarities between layers within and across models. Our analysis unveils that 1) the capacity to represent content information is somewhat unrelated to enhanced speaker representation, 2) specific layers of speech SSL models would be partly specialized in capturing linguistic information, and 3) speaker SSL models tend to disregard linguistic information but exhibit more sophisticated speaker representation.
Paper Structure (13 sections, 6 figures, 1 table)

This paper contains 13 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Schematic diagrams of benchmarking (A) ECAPA-TDNN and (B) conventional speech SSL upstream models with several self-attention blocks on SUPERB superb.
  • Figure 2: Visualization results of weights of weighted sum for (A) WavLM Large, (B) DINO, and (C) supervised speaker model.
  • Figure 3: Similarity results of each layer comparing identical WavLM models with variation of (A) Base, (B) Base+, and (C) Large.
  • Figure 4: Similarity results of each layer comparing identical (A) DINO and (B) supervised speaker model.
  • Figure 5: Similarity results of each layer comparing (A) DINO with supervised speaker model, (B) DINO with WavLM Large, and (C) supervised speaker model with WavLM Large.
  • ...and 1 more figures