Table of Contents
Fetching ...

NAST: Noise Aware Speech Tokenization for Speech Language Models

Shoval Messica, Yossi Adi

TL;DR

The paper addresses robust speech tokenization for Generative Spoken Language Modeling under noisy conditions. It proposes NAST, a three-component model with a frame-level predictor for local units, a residual encoder for global information, and a decoder to reconstruct HuBERT-based embeddings, coupled with a robustness objective that aligns clean and augmented inputs. The training objective combines reconstruction, diversity, and augmentation-robustness losses, promoting broad unit usage while preserving linguistic content despite perturbations. Empirically, NAST outperforms k-means baselines on UED and ABX, and shows stronger invariance to noise, reverberation, pitch-shifts, and time-stretch across multiple benchmarks (sWUGGY, sBLIMP, tSC), indicating meaningful improvements for robust speech language modeling.

Abstract

Speech tokenization is the task of representing speech signals as a sequence of discrete units. Such representations can be later used for various downstream tasks including automatic speech recognition, text-to-speech, etc. More relevant to this study, such representation serves as the basis of Speech Language Models. In this work, we tackle the task of speech tokenization under the noisy setup and present NAST: Noise Aware Speech Tokenization for Speech Language Models. NAST is composed of three main components: (i) a predictor; (ii) a residual encoder; and (iii) a decoder. We evaluate the efficiency of NAST considering several spoken language modeling tasks and show that NAST is superior to the evaluated baselines across all setups. Lastly, we analyze NAST and show its disentanglement properties and robustness to signal variations in the form of noise, reverberation, pitch-shift, and time-stretch. Code and pre-trained models are available at https://github.com/ShovalMessica/NAST.

NAST: Noise Aware Speech Tokenization for Speech Language Models

TL;DR

The paper addresses robust speech tokenization for Generative Spoken Language Modeling under noisy conditions. It proposes NAST, a three-component model with a frame-level predictor for local units, a residual encoder for global information, and a decoder to reconstruct HuBERT-based embeddings, coupled with a robustness objective that aligns clean and augmented inputs. The training objective combines reconstruction, diversity, and augmentation-robustness losses, promoting broad unit usage while preserving linguistic content despite perturbations. Empirically, NAST outperforms k-means baselines on UED and ABX, and shows stronger invariance to noise, reverberation, pitch-shifts, and time-stretch across multiple benchmarks (sWUGGY, sBLIMP, tSC), indicating meaningful improvements for robust speech language modeling.

Abstract

Speech tokenization is the task of representing speech signals as a sequence of discrete units. Such representations can be later used for various downstream tasks including automatic speech recognition, text-to-speech, etc. More relevant to this study, such representation serves as the basis of Speech Language Models. In this work, we tackle the task of speech tokenization under the noisy setup and present NAST: Noise Aware Speech Tokenization for Speech Language Models. NAST is composed of three main components: (i) a predictor; (ii) a residual encoder; and (iii) a decoder. We evaluate the efficiency of NAST considering several spoken language modeling tasks and show that NAST is superior to the evaluated baselines across all setups. Lastly, we analyze NAST and show its disentanglement properties and robustness to signal variations in the form of noise, reverberation, pitch-shift, and time-stretch. Code and pre-trained models are available at https://github.com/ShovalMessica/NAST.
Paper Structure (11 sections, 7 equations, 2 figures, 2 tables)

This paper contains 11 sections, 7 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: A high-level overview of NAST. Clean and augmented signals are fed into the predictor to produce frame-wise logits. The clean logits undergo Gumbel sampling to become one-hot vectors for local representation. The residual encoder extracts a global representation from the clean signal, merged with local ones for decoder input to reconstruct the original signal embeddings. Augmented signal logits are aligned via linear interpolation for robustness enhancement, and diversity loss is applied over the one-hot vectors to ensure full unit usage.
  • Figure 2: (a) tSC performance as a function of noise levels. Results are reported for both NAST and k-means using $100$ clusters. (b) Speaker Probing for Local Representation: Classifiers trained for 100-epoch on LibriSpeech 'dev-clean'.