Table of Contents
Fetching ...

Progressive Residual Extraction based Pre-training for Speech Representation Learning

Tianrui Wang, Jin Li, Ziyang Ma, Rui Cao, Xie Chen, Longbiao Wang, Meng Ge, Xiaobao Wang, Yuguang Wang, Jianwu Dang, Nyima Tashi

TL;DR

Experimental results demonstrate that the ProgRE achieves significant performance improvements across several tasks, such as speaker identification, speech recognition, emotion recognition, speech enhancement, and voice conversion, outperforming excellent SSL methods like wav2vec2.0, HuBERT, and WavLM.

Abstract

Self-supervised learning (SSL) has garnered significant attention in speech processing, excelling in linguistic tasks such as speech recognition. However, jointly improving the performance of pre-trained models on various downstream tasks, each requiring different speech information, poses significant challenges. To this purpose, we propose a progressive residual extraction based self-supervised learning method, named ProgRE. Specifically, we introduce two lightweight and specialized task modules into an encoder-style SSL backbone to enhance its ability to extract pitch variation and speaker information from speech. Furthermore, to prevent the interference of reinforced pitch variation and speaker information with irrelevant content information learning, we residually remove the information extracted by these two modules from the main branch. The main branch is then trained using HuBERT's speech masking prediction to ensure the performance of the Transformer's deep-layer features on content tasks. In this way, we can progressively extract pitch variation, speaker, and content representations from the input speech. Finally, we can combine multiple representations with diverse speech information using different layer weights to obtain task-specific representations for various downstream tasks. Experimental results indicate that our proposed method achieves joint performance improvements on various tasks, such as speaker identification, speech recognition, emotion recognition, speech enhancement, and voice conversion, compared to excellent SSL methods such as wav2vec2.0, HuBERT, and WavLM.

Progressive Residual Extraction based Pre-training for Speech Representation Learning

TL;DR

Experimental results demonstrate that the ProgRE achieves significant performance improvements across several tasks, such as speaker identification, speech recognition, emotion recognition, speech enhancement, and voice conversion, outperforming excellent SSL methods like wav2vec2.0, HuBERT, and WavLM.

Abstract

Self-supervised learning (SSL) has garnered significant attention in speech processing, excelling in linguistic tasks such as speech recognition. However, jointly improving the performance of pre-trained models on various downstream tasks, each requiring different speech information, poses significant challenges. To this purpose, we propose a progressive residual extraction based self-supervised learning method, named ProgRE. Specifically, we introduce two lightweight and specialized task modules into an encoder-style SSL backbone to enhance its ability to extract pitch variation and speaker information from speech. Furthermore, to prevent the interference of reinforced pitch variation and speaker information with irrelevant content information learning, we residually remove the information extracted by these two modules from the main branch. The main branch is then trained using HuBERT's speech masking prediction to ensure the performance of the Transformer's deep-layer features on content tasks. In this way, we can progressively extract pitch variation, speaker, and content representations from the input speech. Finally, we can combine multiple representations with diverse speech information using different layer weights to obtain task-specific representations for various downstream tasks. Experimental results indicate that our proposed method achieves joint performance improvements on various tasks, such as speaker identification, speech recognition, emotion recognition, speech enhancement, and voice conversion, compared to excellent SSL methods such as wav2vec2.0, HuBERT, and WavLM.
Paper Structure (32 sections, 7 equations, 5 figures, 5 tables)

This paper contains 32 sections, 7 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Diagram of HuBERT, which takes raw waveform as input to perform a BERT-like self-supervised pre-training.
  • Figure 2: Diagram of residual vector quantization. RVQ performs progressive residual quantization of $\bm{X}$.
  • Figure 3: Diagram of the weighted-sum mechanism-based speech representation extraction. Speech is encoded into representations by a multi-layer SSL model, and then the task-specific representation for various downstream tasks is assembled with task-specific layer weights.
  • Figure 4: The diagram depicts our ProgRE model, which takes a waveform as input and progressively extracts three types of representations: pitch variation $\bm{O}^\text{ p}$, speaker $\bm{O}^\text{ s}$, and content $\bm{O}^\text{ c}$ (indicated by black solid lines). The model is supervised by two offline systems trained on the unlabeled dataset (indicated by blue solid lines). For fine-tuning, a weighted-sum mechanism is employed (indicated by black dotted lines).
  • Figure 5: Layer-wise weight visualization in the weighted-sum mechanism of HuBERT and ProgRE. The first row weights come from HuBERT, and the second row comes from ProgRE. We show weights fine-tuned on ASR and SID tasks for both Base and Large version models (left column is Base version, right column is Large version).