Table of Contents
Fetching ...

dMel: Speech Tokenization made Simple

Richard He Bai, Tatiana Likhomanenko, Ruixiang Zhang, Zijin Gu, Zakaria Aldeneh, Navdeep Jaitly

TL;DR

This work introduces dMel, a training-free tokenization that discretizes log-mel-filterbank energies into a compact, ordinal-bin representation, enabling a single decoder-only LM-style transformer to perform both automatic speech recognition (ASR) and text-to-speech (TTS) within a unified RichTTS/RichASR framework. By operating directly on log-mel energies and employing parallel, per-channel encoding with span masking and k-frame expansion, the approach achieves competitive or superior ASR and TTS performance while remaining robust to out-of-domain audio and allowing efficient training and inference. The key contributions include (i) a simple encoder-free speech tokenizer, (ii) an LM-style decoder capable of joint speech-text modeling, and (iii) extensive experiments on LibriSpeech and related datasets showing strong WER/CER results and natural, long-form speech generation. This unified, physics-based tokenization reduces complexity, improves generalization, and paves the way for streamlined joint speech-text modeling without heavy pretraining or multi-stage architectures.

Abstract

Large language models have revolutionized natural language processing by leveraging self-supervised pretraining on vast textual data. Inspired by this success, researchers have investigated various compression-based speech tokenization methods to discretize continuous speech signals, enabling the application of language modeling techniques to discrete tokens. However, audio compressor introduces additional complexity and computational cost, and often fail on out-of-domain audio signals. In this work, we introduce a novel speech representation (dmel) that discretizes mel-filterbank channels into intensity bins, creating a simpler yet more effective representation compared to existing speech tokenization methods. Our approach demonstrates superior performance in preserving audio content, robustness to out-of-domain data, and offers a training-free, natural, and streamable representation. To address the high-dimensional nature of log-mel spectrograms, we propose an efficient parallel encoding and decoding method for high-dimensional tokens using an LM-style transformer architecture. This innovation enables us to develop RichTTS and RichASR, two models sharing the same architecture while achieving comparable or better results than specialized existing methods. Our results demonstrate the effectiveness of dmel in achieving high performance on both speech synthesis and recognition tasks within a unified framework, paving the way for efficient and effective joint modeling of speech and text.

dMel: Speech Tokenization made Simple

TL;DR

This work introduces dMel, a training-free tokenization that discretizes log-mel-filterbank energies into a compact, ordinal-bin representation, enabling a single decoder-only LM-style transformer to perform both automatic speech recognition (ASR) and text-to-speech (TTS) within a unified RichTTS/RichASR framework. By operating directly on log-mel energies and employing parallel, per-channel encoding with span masking and k-frame expansion, the approach achieves competitive or superior ASR and TTS performance while remaining robust to out-of-domain audio and allowing efficient training and inference. The key contributions include (i) a simple encoder-free speech tokenizer, (ii) an LM-style decoder capable of joint speech-text modeling, and (iii) extensive experiments on LibriSpeech and related datasets showing strong WER/CER results and natural, long-form speech generation. This unified, physics-based tokenization reduces complexity, improves generalization, and paves the way for streamlined joint speech-text modeling without heavy pretraining or multi-stage architectures.

Abstract

Large language models have revolutionized natural language processing by leveraging self-supervised pretraining on vast textual data. Inspired by this success, researchers have investigated various compression-based speech tokenization methods to discretize continuous speech signals, enabling the application of language modeling techniques to discrete tokens. However, audio compressor introduces additional complexity and computational cost, and often fail on out-of-domain audio signals. In this work, we introduce a novel speech representation (dmel) that discretizes mel-filterbank channels into intensity bins, creating a simpler yet more effective representation compared to existing speech tokenization methods. Our approach demonstrates superior performance in preserving audio content, robustness to out-of-domain data, and offers a training-free, natural, and streamable representation. To address the high-dimensional nature of log-mel spectrograms, we propose an efficient parallel encoding and decoding method for high-dimensional tokens using an LM-style transformer architecture. This innovation enables us to develop RichTTS and RichASR, two models sharing the same architecture while achieving comparable or better results than specialized existing methods. Our results demonstrate the effectiveness of dmel in achieving high performance on both speech synthesis and recognition tasks within a unified framework, paving the way for efficient and effective joint modeling of speech and text.
Paper Structure (37 sections, 6 equations, 7 figures, 14 tables)

This paper contains 37 sections, 6 equations, 7 figures, 14 tables.

Figures (7)

  • Figure 1: Prior works on speech tokenization use either heavy self-supervised pretrained encoders baevski2020wav2vechsu2021hubert to extract semantic tokens (and train a separate decoder for it lakhotia2021generative) or learn compression encoder-decoder models with residual vector quantizations zeghidour2021soundstreamdefossez2022high to obtain acoustic tokens. By contrast we eliminate the encoder and simply discretize mel-filterbanks (dMel) to encode audio, and use a simple mel-filterbank vocoder yamamoto2020parallel to reconstruct speech signals.
  • Figure 2: (Left) For a time step $t$ encoded dMel from Figure \ref{['fig:dmel']} is inputted to the transformer decoder to produce final embeddings for each of the frequency channels in parallel. (Right) Unified Speech-Text Transformer Decoder with speech tokens as dMel.
  • Figure 3: Unified Speech-Text Transformer Decoder with speech tokens as dMel where we predict multiple, e.g. two, frames in parallel reducing the frame rate, e.g. by 2x: dMel tokens for every two frames are stacked together to form the input into the decoder and predicted in parallel afterwards.
  • Figure 4: Speech reconstruction results on 300 random samples from LibriSpeech test-clean set when noise is added: either background music from bogdanov2019mtg dataset or speech noise from test-other. WER (%) is evaluated with WhisperX ASR ("base.en"). Audio examples are in our https://apple.github.io/dmel-demo/.
  • Figure 5: ASR and TTS results (WER, %) with dMel speech tokenizer and different number of bins (codebook size) for discretization in dMel. All models are trained on LibriSpeech 960h.
  • ...and 2 more figures