Table of Contents
Fetching ...

NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks

He Huang, Taejin Park, Kunal Dhawan, Ivan Medennikov, Krishna C. Puvvada, Nithin Rao Koluguri, Weiqing Wang, Jagadeesh Balam, Boris Ginsburg

TL;DR

NEST addresses the efficiency and versatility gap in speech self-supervised learning by introducing a streamlined encoder built on a FastConformer with 8x subsampling, paired with fixed random-projection quantization and generalized noisy augmentation. It replaces heavier clustering-based tokenization with a compact codebook and a simple masking objective, enabling fast pretraining on large English datasets. Across SUPERB, multilingual ASR, translation, diarization, and SLU benchmarks, NEST achieves state-of-the-art or competitive results with substantially less data and compute, and demonstrates notable cross-language transfer. The work provides practical benefits for real-world deployment and offers public code and checkpoints through NVIDIA NeMo, highlighting the method’s broad applicability and impact on speech processing tasks.

Abstract

Self-supervised learning has been proved to benefit a wide range of speech processing tasks, such as speech recognition/translation, speaker verification and diarization, etc. However, most of current approaches are computationally expensive. In this paper, we propose a simplified and more efficient self-supervised learning framework termed as NeMo Encoder for Speech Tasks (NEST). Specifically, we adopt the FastConformer architecture with 8x sub-sampling rate, which is faster than Transformer or Conformer architectures. Instead of clustering-based quantization, we use fixed random projection for its simplicity and effectiveness. We also implement a generalized noisy speech augmentation that teaches the model to disentangle the main speaker from noise or other speakers. Experiments show that \model improves over existing self-supervised models and achieves new state-of-the-art performance on a variety of speech processing tasks, such as speech recognition/translation, speaker diarization, spoken language understanding, etc. Code and checkpoints are publicly available via NVIDIA NeMo framework.

NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks

TL;DR

NEST addresses the efficiency and versatility gap in speech self-supervised learning by introducing a streamlined encoder built on a FastConformer with 8x subsampling, paired with fixed random-projection quantization and generalized noisy augmentation. It replaces heavier clustering-based tokenization with a compact codebook and a simple masking objective, enabling fast pretraining on large English datasets. Across SUPERB, multilingual ASR, translation, diarization, and SLU benchmarks, NEST achieves state-of-the-art or competitive results with substantially less data and compute, and demonstrates notable cross-language transfer. The work provides practical benefits for real-world deployment and offers public code and checkpoints through NVIDIA NeMo, highlighting the method’s broad applicability and impact on speech processing tasks.

Abstract

Self-supervised learning has been proved to benefit a wide range of speech processing tasks, such as speech recognition/translation, speaker verification and diarization, etc. However, most of current approaches are computationally expensive. In this paper, we propose a simplified and more efficient self-supervised learning framework termed as NeMo Encoder for Speech Tasks (NEST). Specifically, we adopt the FastConformer architecture with 8x sub-sampling rate, which is faster than Transformer or Conformer architectures. Instead of clustering-based quantization, we use fixed random projection for its simplicity and effectiveness. We also implement a generalized noisy speech augmentation that teaches the model to disentangle the main speaker from noise or other speakers. Experiments show that \model improves over existing self-supervised models and achieves new state-of-the-art performance on a variety of speech processing tasks, such as speech recognition/translation, speaker diarization, spoken language understanding, etc. Code and checkpoints are publicly available via NVIDIA NeMo framework.
Paper Structure (15 sections, 2 figures, 5 tables)

This paper contains 15 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: NEST serves as a bird nest that incubates the variety of speech task models.
  • Figure 2: (a) The proposed NEST framework for speech self-supervised learning. (b) Two ways to use NEST encoder: (left) use as weight initialization for tasks that require more parameters (e.g., speech recognition); (right) learn weighted summation of features from different layers of the frozen NEST for tasks that require less trainable parameters (e.g., speaker verification).