Table of Contents
Fetching ...

Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks

Santiago Pascual, Mirco Ravanelli, Joan Serrà, Antonio Bonafonte, Yoshua Bengio

TL;DR

This work tackles the challenge of unsupervised speech representation learning by introducing PASE, a problem-agnostic speech encoder paired with seven cooperative self-supervised workers that solve diverse tasks. The jointly trained encoder learns robust, transferable embeddings that capture speaker identity, phonemes, and emotional cues, and can be used directly or fine-tuned for downstream tasks like speaker identification, emotion recognition, and ASR. Empirical results show PASE outperforms traditional features (MFCC/FBANK), especially when fine-tuned, and transfers well to noisy and reverberant conditions such as DIRHA, indicating practical impact as a universal speech feature extractor. The approach highlights the benefit of consensus across multiple self-supervised tasks to avoid superficial representations and provides a scalable, exportable framework for expanding to additional tasks.

Abstract

Learning good representations without supervision is still an open issue in machine learning, and is particularly challenging for speech signals, which are often characterized by long sequences with a complex hierarchical structure. Some recent works, however, have shown that it is possible to derive useful speech representations by employing a self-supervised encoder-discriminator approach. This paper proposes an improved self-supervised method, where a single neural encoder is followed by multiple workers that jointly solve different self-supervised tasks. The needed consensus across different tasks naturally imposes meaningful constraints to the encoder, contributing to discover general representations and to minimize the risk of learning superficial ones. Experiments show that the proposed approach can learn transferable, robust, and problem-agnostic features that carry on relevant information from the speech signal, such as speaker identity, phonemes, and even higher-level features such as emotional cues. In addition, a number of design choices make the encoder easily exportable, facilitating its direct usage or adaptation to different problems.

Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks

TL;DR

This work tackles the challenge of unsupervised speech representation learning by introducing PASE, a problem-agnostic speech encoder paired with seven cooperative self-supervised workers that solve diverse tasks. The jointly trained encoder learns robust, transferable embeddings that capture speaker identity, phonemes, and emotional cues, and can be used directly or fine-tuned for downstream tasks like speaker identification, emotion recognition, and ASR. Empirical results show PASE outperforms traditional features (MFCC/FBANK), especially when fine-tuned, and transfers well to noisy and reverberant conditions such as DIRHA, indicating practical impact as a universal speech feature extractor. The approach highlights the benefit of consensus across multiple self-supervised tasks to avoid superficial representations and provides a scalable, exportable framework for expanding to additional tasks.

Abstract

Learning good representations without supervision is still an open issue in machine learning, and is particularly challenging for speech signals, which are often characterized by long sequences with a complex hierarchical structure. Some recent works, however, have shown that it is possible to derive useful speech representations by employing a self-supervised encoder-discriminator approach. This paper proposes an improved self-supervised method, where a single neural encoder is followed by multiple workers that jointly solve different self-supervised tasks. The needed consensus across different tasks naturally imposes meaningful constraints to the encoder, contributing to discover general representations and to minimize the risk of learning superficial ones. Experiments show that the proposed approach can learn transferable, robust, and problem-agnostic features that carry on relevant information from the speech signal, such as speaker identity, phonemes, and even higher-level features such as emotional cues. In addition, a number of design choices make the encoder easily exportable, facilitating its direct usage or adaptation to different problems.

Paper Structure

This paper contains 13 sections, 1 equation, 1 figure, 3 tables.

Figures (1)

  • Figure 1: The PASE architecture, with the considered workers.