Table of Contents
Fetching ...

An Empirical Analysis of Speech Self-Supervised Learning at Multiple Resolutions

Theo Clark, Benedetta Cevoli, Eloy de Jong, Timofey Abramski, Jamie Dougherty

TL;DR

This study presents an initial analysis of layer-wise representations in multi-scale architectures, with a focus on Canonical Correlation Analysis (CCA) and Mutual Information (MI), and finds that the improved performance on SUPERB tasks is primarily due to the auxiliary low-resolution loss rather than the downsampling itself.

Abstract

Self-supervised learning (SSL) models have become crucial in speech processing, with recent advancements concentrating on developing architectures that capture representations across multiple timescales. The primary goal of these multi-scale architectures is to exploit the hierarchical nature of speech, where lower-resolution components aim to capture representations that align with increasingly abstract concepts (e.g., from phones to words to sentences). Although multi-scale approaches have demonstrated some improvements over single-scale models, the precise reasons for these enhancements have poor empirical support. In this study, we present an initial analysis of layer-wise representations in multi-scale architectures, with a focus on Canonical Correlation Analysis (CCA) and Mutual Information (MI). We apply this analysis to Multi-Resolution HuBERT (MR-HuBERT) and find that (1) the improved performance on SUPERB tasks is primarily due to the auxiliary low-resolution loss rather than the downsampling itself, and (2) downsampling to lower resolutions neither improves downstream performance nor correlates with higher-level information (e.g., words), though it does improve computational efficiency. These findings challenge assumptions about the multi-scale nature of MR-HuBERT and motivate the importance of disentangling computational efficiency from learning better representations.

An Empirical Analysis of Speech Self-Supervised Learning at Multiple Resolutions

TL;DR

This study presents an initial analysis of layer-wise representations in multi-scale architectures, with a focus on Canonical Correlation Analysis (CCA) and Mutual Information (MI), and finds that the improved performance on SUPERB tasks is primarily due to the auxiliary low-resolution loss rather than the downsampling itself.

Abstract

Self-supervised learning (SSL) models have become crucial in speech processing, with recent advancements concentrating on developing architectures that capture representations across multiple timescales. The primary goal of these multi-scale architectures is to exploit the hierarchical nature of speech, where lower-resolution components aim to capture representations that align with increasingly abstract concepts (e.g., from phones to words to sentences). Although multi-scale approaches have demonstrated some improvements over single-scale models, the precise reasons for these enhancements have poor empirical support. In this study, we present an initial analysis of layer-wise representations in multi-scale architectures, with a focus on Canonical Correlation Analysis (CCA) and Mutual Information (MI). We apply this analysis to Multi-Resolution HuBERT (MR-HuBERT) and find that (1) the improved performance on SUPERB tasks is primarily due to the auxiliary low-resolution loss rather than the downsampling itself, and (2) downsampling to lower resolutions neither improves downstream performance nor correlates with higher-level information (e.g., words), though it does improve computational efficiency. These findings challenge assumptions about the multi-scale nature of MR-HuBERT and motivate the importance of disentangling computational efficiency from learning better representations.

Paper Structure

This paper contains 16 sections, 1 equation, 6 figures, 3 tables.

Figures (6)

  • Figure 1: $\texttt{MR-HuBERT}$ framework which incorporates masked unit prediction at multiple resolutions.
  • Figure 2: Impact of auxiliary loss, downsampling and added resolutions on information content and importance in downstream performance. Fig. \ref{['fig:cca_word']} shows CCA scores for $\texttt{HuBERT}$ and multiple $\texttt{MR-HuBERT}$ variants. Comparing these models, we see that the auxiliary loss is the primary factor in increasing the word level information in earlier layers. Fig. \ref{['fig:superb_weightings']} shows SUPERB weights for the ASR and SF tasks, and again shows that the auxiliary loss is responsible for middle layers being useful for downstream tasks.
  • Figure 3: Layer-wise analyses of base models of $\texttt{MR-HuBERT}$ and $\texttt{HuBERT}$ models. (A) MI scores between mean-pooled word-level representations and word identities. (B) MI scores between mean-pooled phone-level representations and phone identities. (C) CCA similarity between mean-pooled phone-level representations and phone identities (one-hot encoded). (D) CCA similarity between mean-pooled word-level representations and AGWE embeddings. (E) CCA similarity between mean-pooled word-level representations and GloVE embeddings. (F) Spearman’s $\rho$ correlation between annotated human judgments and cosine similarity of spoken utterance pairs.
  • Figure 4: CCA similarity between frame-level representations and fbanks.
  • Figure 5: Layer-wise analyses of large models of $\texttt{MR-HuBERT}$ and $\texttt{HuBERT}$ models. (A) MI scores between mean-pooled word-level representations and word identities (one-hot encoded). (B) MI scores between mean-pooled phone-level representations and phone identities (one-hot encoded). (C) CCA similarity between mean-pooled word-level representations and AGWE embeddings and GloVE embeddings. (D) CCA similarity between mean-pooled frame-level representations and fbanks as well as phone-level representations and phone identities (one-hot encoded). (E) CCA similarity between mean-pooled word-level representations and word identities (one-hot encoded) as well as Spearman’s $\rho$ correlation between annotated human judgments and cosine similarity of spoken utterance pairs.
  • ...and 1 more figures