NAST: Noise Aware Speech Tokenization for Speech Language Models
Shoval Messica, Yossi Adi
TL;DR
The paper addresses robust speech tokenization for Generative Spoken Language Modeling under noisy conditions. It proposes NAST, a three-component model with a frame-level predictor for local units, a residual encoder for global information, and a decoder to reconstruct HuBERT-based embeddings, coupled with a robustness objective that aligns clean and augmented inputs. The training objective combines reconstruction, diversity, and augmentation-robustness losses, promoting broad unit usage while preserving linguistic content despite perturbations. Empirically, NAST outperforms k-means baselines on UED and ABX, and shows stronger invariance to noise, reverberation, pitch-shifts, and time-stretch across multiple benchmarks (sWUGGY, sBLIMP, tSC), indicating meaningful improvements for robust speech language modeling.
Abstract
Speech tokenization is the task of representing speech signals as a sequence of discrete units. Such representations can be later used for various downstream tasks including automatic speech recognition, text-to-speech, etc. More relevant to this study, such representation serves as the basis of Speech Language Models. In this work, we tackle the task of speech tokenization under the noisy setup and present NAST: Noise Aware Speech Tokenization for Speech Language Models. NAST is composed of three main components: (i) a predictor; (ii) a residual encoder; and (iii) a decoder. We evaluate the efficiency of NAST considering several spoken language modeling tasks and show that NAST is superior to the evaluated baselines across all setups. Lastly, we analyze NAST and show its disentanglement properties and robustness to signal variations in the form of noise, reverberation, pitch-shift, and time-stretch. Code and pre-trained models are available at https://github.com/ShovalMessica/NAST.
