Table of Contents
Fetching ...

Tracking the emergence of linguistic structure in self-supervised models learning from speech

Marianne de Heer Kloots, Martijn Bentum, Hosein Mohebbi, Charlotte Pouw, Gaofei Shen, Willem Zuidema

Abstract

Self-supervised speech models learn effective representations of spoken language, which have been shown to reflect various aspects of linguistic structure. But when does such structure emerge in model training? We study the encoding of a wide range of linguistic structures, across layers and intermediate checkpoints of six Wav2Vec2 and HuBERT models trained on spoken Dutch. We find that different levels of linguistic structure show notably distinct layerwise patterns as well as learning trajectories, which can partially be explained by differences in their degree of abstraction from the acoustic signal and the timescale at which information from the input is integrated. Moreover, we find that the level at which pre-training objectives are defined strongly affects both the layerwise organization and the learning trajectories of linguistic structures, with greater parallelism induced by higher-order prediction tasks (i.e. iteratively refined pseudo-labels).

Tracking the emergence of linguistic structure in self-supervised models learning from speech

Abstract

Self-supervised speech models learn effective representations of spoken language, which have been shown to reflect various aspects of linguistic structure. But when does such structure emerge in model training? We study the encoding of a wide range of linguistic structures, across layers and intermediate checkpoints of six Wav2Vec2 and HuBERT models trained on spoken Dutch. We find that different levels of linguistic structure show notably distinct layerwise patterns as well as learning trajectories, which can partially be explained by differences in their degree of abstraction from the acoustic signal and the timescale at which information from the input is integrated. Moreover, we find that the level at which pre-training objectives are defined strongly affects both the layerwise organization and the learning trajectories of linguistic structures, with greater parallelism induced by higher-order prediction tasks (i.e. iteratively refined pseudo-labels).

Paper Structure

This paper contains 22 sections, 9 figures, 9 tables.

Figures (9)

  • Figure 1: (A) We train a set of six models on the same dataset, consisting of 831 hours of Dutch speech recordings. (B) We explore how results vary between model architectures with minimal differences in training set-up (Wav2Vec2, HuBERT-I1, HuBERT-I2). (C) We probe each model's internal representations for nine types of linguistic structure, which differ in their degrees of abstraction from the acoustic signal and in their timescales of information integration. We compare results for each structure across model layers as well as training steps.
  • Figure 2: Example outputs for two probing techniques applied to different layers of a Wav2Vec2 model at 100K training steps, with accompanying scores. Data points in the top figures visualize the top 3 LDA directions for syllable type test samples, colored by syllable type. Yellow lines in the bottom row visualize model-reconstrcuted dependency links for one test sentence; black lines indicate the true dependency structure.
  • Figure 3: Layerwise scores for all representational probes (rows) and three speech SSL models (columns). Grey dashed lines indicate baseline scores extracted from a Wav2Vec2 model trained on non-speech acoustic scenes. Shading indicates the std. dev. over 5 folds.
  • Figure 4: Learning trajectories for all representational probes, across training checkpoints of one Wav2Vec2 model (seed 2). Left: best-layer score for each checkpoint, with fitted sigmoid curves for the speech-trained model; grey dots indicate non-speech baseline model scores. Right: all layerwise scores across training.
  • Figure 5: Normalized parametric curvesvisualizing the learning trajectories for all probe scores across training checkpoints, in all six models. Stars indicate the training step where 95% of the maximum observed score across training is achieved; these steps are again marked in the rightmost plots, with different structure levels on separate rows. Different levels of linguistic structure consistently show distinct learning dynamics across S3M architectures and model seeds, with increased parallelism between levels for HuBERT-I2 models compared to HuBERT-I1 and Wav2Vec2.
  • ...and 4 more figures