Table of Contents
Fetching ...

Chunk Based Speech Pre-training with High Resolution Finite Scalar Quantization

Yun Tang, Cindy Tseng

TL;DR

Chunk SSL tackles the challenge of low-latency streaming speech processing by unifying streaming and offline pre-training through a chunk-based self-supervised framework. It combines copy-and-append data augmentation (CADA), a CADA-compatible Conformer, and a high-resolution FSQ codebook with a memory-efficient group masked loss to reconstruct masked frames from context across chunks. The approach yields competitive streaming and offline results on Librispeech and MuST-C, reduces performance gaps between modes, and demonstrates strong translation performance on MuST-C, all while maintaining feasible latency. The core innovations—CADA, high-resolution FSQ with per-channel grouping, and dynamic chunk pre-training—offer a practical path to deploying a single model across streaming and offline speech tasks.

Abstract

Low latency speech human-machine communication is becoming increasingly necessary as speech technology advances quickly in the last decade. One of the primary factors behind the advancement of speech technology is self-supervised learning. Most self-supervised learning algorithms are designed with full utterance assumption and compromises have to made if partial utterances are presented, which are common in the streaming applications. In this work, we propose a chunk based self-supervised learning (Chunk SSL) algorithm as an unified solution for both streaming and offline speech pre-training. Chunk SSL is optimized with the masked prediction loss and an acoustic encoder is encouraged to restore indices of those masked speech frames with help from unmasked frames in the same chunk and preceding chunks. A copy and append data augmentation approach is proposed to conduct efficient chunk based pre-training. Chunk SSL utilizes a finite scalar quantization (FSQ) module to discretize input speech features and our study shows a high resolution FSQ codebook, i.e., a codebook with vocabulary size up to a few millions, is beneficial to transfer knowledge from the pre-training task to the downstream tasks. A group masked prediction loss is employed during pre-training to alleviate the high memory and computation cost introduced by the large codebook. The proposed approach is examined in two speech to text tasks, i.e., speech recognition and speech translation. Experimental results on the \textsc{Librispeech} and \textsc{Must-C} datasets show that the proposed method could achieve very competitive results for speech to text tasks at both streaming and offline modes.

Chunk Based Speech Pre-training with High Resolution Finite Scalar Quantization

TL;DR

Chunk SSL tackles the challenge of low-latency streaming speech processing by unifying streaming and offline pre-training through a chunk-based self-supervised framework. It combines copy-and-append data augmentation (CADA), a CADA-compatible Conformer, and a high-resolution FSQ codebook with a memory-efficient group masked loss to reconstruct masked frames from context across chunks. The approach yields competitive streaming and offline results on Librispeech and MuST-C, reduces performance gaps between modes, and demonstrates strong translation performance on MuST-C, all while maintaining feasible latency. The core innovations—CADA, high-resolution FSQ with per-channel grouping, and dynamic chunk pre-training—offer a practical path to deploying a single model across streaming and offline speech tasks.

Abstract

Low latency speech human-machine communication is becoming increasingly necessary as speech technology advances quickly in the last decade. One of the primary factors behind the advancement of speech technology is self-supervised learning. Most self-supervised learning algorithms are designed with full utterance assumption and compromises have to made if partial utterances are presented, which are common in the streaming applications. In this work, we propose a chunk based self-supervised learning (Chunk SSL) algorithm as an unified solution for both streaming and offline speech pre-training. Chunk SSL is optimized with the masked prediction loss and an acoustic encoder is encouraged to restore indices of those masked speech frames with help from unmasked frames in the same chunk and preceding chunks. A copy and append data augmentation approach is proposed to conduct efficient chunk based pre-training. Chunk SSL utilizes a finite scalar quantization (FSQ) module to discretize input speech features and our study shows a high resolution FSQ codebook, i.e., a codebook with vocabulary size up to a few millions, is beneficial to transfer knowledge from the pre-training task to the downstream tasks. A group masked prediction loss is employed during pre-training to alleviate the high memory and computation cost introduced by the large codebook. The proposed approach is examined in two speech to text tasks, i.e., speech recognition and speech translation. Experimental results on the \textsc{Librispeech} and \textsc{Must-C} datasets show that the proposed method could achieve very competitive results for speech to text tasks at both streaming and offline modes.

Paper Structure

This paper contains 27 sections, 5 equations, 3 figures, 9 tables, 1 algorithm.

Figures (3)

  • Figure 1: Chunkwise self-supervised training.
  • Figure 2: Illustration of CADA sequence computation.
  • Figure 3: Latency v.s. Performance