Table of Contents
Fetching ...

What Do Self-Supervised Speech Models Know About Words?

Ankita Pasad, Chung-Ming Chien, Shane Settle, Karen Livescu

TL;DR

This paper investigates what word-level knowledge self-supervised speech models (S3Ms) learn during pre-training by combining lightweight canonical correlation analysis (PWCCA) with training-free tasks across ten diverse S3Ms, including visually grounded variants. It finds that word-identifying information concentrates near the center of word segments and that the pre-training objective and model size strongly influence where linguistics content appears across layers. Visually grounded S3Ms consistently outperform speech-only models on acoustic word discrimination, word segmentation, and semantic similarity tasks, even when using simple analysis pipelines. The work demonstrates robust cross-model trends, highlights domain effects on task performance, and provides practical guidance for selecting layers and models for word-level applications. These insights advance understanding of word-level representations in S3Ms and offer a foundation for future analyses of higher-level linguistic structures.

Abstract

Many self-supervised speech models (S3Ms) have been introduced over the last few years, improving performance and data efficiency on various speech tasks. However, these empirical successes alone do not give a complete picture of what is learned during pre-training. Recent work has begun analyzing how S3Ms encode certain properties, such as phonetic and speaker information, but we still lack a proper understanding of knowledge encoded at the word level and beyond. In this work, we use lightweight analysis methods to study segment-level linguistic properties -- word identity, boundaries, pronunciation, syntactic features, and semantic features -- encoded in S3Ms. We present a comparative study of layer-wise representations from ten S3Ms and find that (i) the frame-level representations within each word segment are not all equally informative, and (ii) the pre-training objective and model size heavily influence the accessibility and distribution of linguistic information across layers. We also find that on several tasks -- word discrimination, word segmentation, and semantic sentence similarity -- S3Ms trained with visual grounding outperform their speech-only counterparts. Finally, our task-based analyses demonstrate improved performance on word segmentation and acoustic word discrimination while using simpler methods than prior work.

What Do Self-Supervised Speech Models Know About Words?

TL;DR

This paper investigates what word-level knowledge self-supervised speech models (S3Ms) learn during pre-training by combining lightweight canonical correlation analysis (PWCCA) with training-free tasks across ten diverse S3Ms, including visually grounded variants. It finds that word-identifying information concentrates near the center of word segments and that the pre-training objective and model size strongly influence where linguistics content appears across layers. Visually grounded S3Ms consistently outperform speech-only models on acoustic word discrimination, word segmentation, and semantic similarity tasks, even when using simple analysis pipelines. The work demonstrates robust cross-model trends, highlights domain effects on task performance, and provides practical guidance for selecting layers and models for word-level applications. These insights advance understanding of word-level representations in S3Ms and offer a foundation for future analyses of higher-level linguistic structures.

Abstract

Many self-supervised speech models (S3Ms) have been introduced over the last few years, improving performance and data efficiency on various speech tasks. However, these empirical successes alone do not give a complete picture of what is learned during pre-training. Recent work has begun analyzing how S3Ms encode certain properties, such as phonetic and speaker information, but we still lack a proper understanding of knowledge encoded at the word level and beyond. In this work, we use lightweight analysis methods to study segment-level linguistic properties -- word identity, boundaries, pronunciation, syntactic features, and semantic features -- encoded in S3Ms. We present a comparative study of layer-wise representations from ten S3Ms and find that (i) the frame-level representations within each word segment are not all equally informative, and (ii) the pre-training objective and model size heavily influence the accessibility and distribution of linguistic information across layers. We also find that on several tasks -- word discrimination, word segmentation, and semantic sentence similarity -- S3Ms trained with visual grounding outperform their speech-only counterparts. Finally, our task-based analyses demonstrate improved performance on word segmentation and acoustic word discrimination while using simpler methods than prior work.
Paper Structure (25 sections, 9 figures, 3 tables)

This paper contains 25 sections, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Our word segmentation algorithm.
  • Figure 2: Evaluation of the word-identifying information in mean-pooled word segment representations from Base (left) and Large (right) S3Ms.
  • Figure 3: DTW-AWD results on LibriSpeech dev-clean.
  • Figure 4: Correlation with word identity for wav2vec2-Base when using a single frame to represent a word segment.
  • Figure 5: Correlation with word identity and AWD scores when pooling over segment quarters.
  • ...and 4 more figures