Table of Contents
Fetching ...

Simultaneous or Sequential Training? How Speech Representations Cooperate in a Multi-Task Self-Supervised Learning System

Khazar Khorrami, María Andrea Cruz Blandón, Tuomas Virtanen, Okko Räsänen

TL;DR

It is found that sequential training with wav2vec 2.0 first and VGS next provides higher performance on audio-visual retrieval compared to simultaneous optimization of both learning mechanisms, and the results suggest that phonemic representations learned through the VGS mechanism may generalize better across datasets compared to those learned with SSL.

Abstract

Speech representation learning with self-supervised algorithms has resulted in notable performance boosts in many downstream tasks. Recent work combined self-supervised learning (SSL) and visually grounded speech (VGS) processing mechanisms for representation learning. The joint training with SSL and VGS mechanisms provides the opportunity to utilize both unlabeled speech and speech-related visual information based on data availability. This has shown to enhance the quality of learned representations, especially at encoding semantic- and lexical-level knowledge. In this work, we further study the joint optimization of wav2vec 2.0-based SSL and transformer-based VGS as a multi-task learning system. We explore a set of training scenarios to understand how speech representations are shared or transferred between the two tasks, and what is the optimal training strategy for cross-modal semantic retrieval and phoneme discrimination performance. As a result, we find that sequential training with wav2vec 2.0 first and VGS next provides higher performance on audio-visual retrieval compared to simultaneous optimization of both learning mechanisms. However, the parallel SSL-VGS training reduces the effects of catastrophic forgetting when switching between optimization criteria. Moreover, the results suggest that phonemic representations learned through the VGS mechanism may generalize better across datasets compared to those learned with SSL.

Simultaneous or Sequential Training? How Speech Representations Cooperate in a Multi-Task Self-Supervised Learning System

TL;DR

It is found that sequential training with wav2vec 2.0 first and VGS next provides higher performance on audio-visual retrieval compared to simultaneous optimization of both learning mechanisms, and the results suggest that phonemic representations learned through the VGS mechanism may generalize better across datasets compared to those learned with SSL.

Abstract

Speech representation learning with self-supervised algorithms has resulted in notable performance boosts in many downstream tasks. Recent work combined self-supervised learning (SSL) and visually grounded speech (VGS) processing mechanisms for representation learning. The joint training with SSL and VGS mechanisms provides the opportunity to utilize both unlabeled speech and speech-related visual information based on data availability. This has shown to enhance the quality of learned representations, especially at encoding semantic- and lexical-level knowledge. In this work, we further study the joint optimization of wav2vec 2.0-based SSL and transformer-based VGS as a multi-task learning system. We explore a set of training scenarios to understand how speech representations are shared or transferred between the two tasks, and what is the optimal training strategy for cross-modal semantic retrieval and phoneme discrimination performance. As a result, we find that sequential training with wav2vec 2.0 first and VGS next provides higher performance on audio-visual retrieval compared to simultaneous optimization of both learning mechanisms. However, the parallel SSL-VGS training reduces the effects of catastrophic forgetting when switching between optimization criteria. Moreover, the results suggest that phonemic representations learned through the VGS mechanism may generalize better across datasets compared to those learned with SSL.
Paper Structure (12 sections, 1 equation, 3 figures, 2 tables)

This paper contains 12 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: VGS+ as a joint model of VGS and SSL training. The green block is optimized for the both tasks.
  • Figure 2: ABX error (W-C) for the tested training variants. Left: as a function of speech encoder (no. 1--8) and decoder (9--12) layer after the full 70 training epochs. Right: best layer score across the training epochs. Note the logarithmic scale of the y-axes.
  • Figure 3: Training loss curves for all tested model variants (log-scale): Left: $loss_{AV}$. Right: $loss_{AUD}$. The graphs show full training (70 epochs) for the base models (solid lines) and the combinations of 20 epochs of pretraining and 50 epochs of main training for the training schedule variants.