dMel: Speech Tokenization made Simple
Richard He Bai, Tatiana Likhomanenko, Ruixiang Zhang, Zijin Gu, Zakaria Aldeneh, Navdeep Jaitly
TL;DR
This work introduces dMel, a training-free tokenization that discretizes log-mel-filterbank energies into a compact, ordinal-bin representation, enabling a single decoder-only LM-style transformer to perform both automatic speech recognition (ASR) and text-to-speech (TTS) within a unified RichTTS/RichASR framework. By operating directly on log-mel energies and employing parallel, per-channel encoding with span masking and k-frame expansion, the approach achieves competitive or superior ASR and TTS performance while remaining robust to out-of-domain audio and allowing efficient training and inference. The key contributions include (i) a simple encoder-free speech tokenizer, (ii) an LM-style decoder capable of joint speech-text modeling, and (iii) extensive experiments on LibriSpeech and related datasets showing strong WER/CER results and natural, long-form speech generation. This unified, physics-based tokenization reduces complexity, improves generalization, and paves the way for streamlined joint speech-text modeling without heavy pretraining or multi-stage architectures.
Abstract
Large language models have revolutionized natural language processing by leveraging self-supervised pretraining on vast textual data. Inspired by this success, researchers have investigated various compression-based speech tokenization methods to discretize continuous speech signals, enabling the application of language modeling techniques to discrete tokens. However, audio compressor introduces additional complexity and computational cost, and often fail on out-of-domain audio signals. In this work, we introduce a novel speech representation (dmel) that discretizes mel-filterbank channels into intensity bins, creating a simpler yet more effective representation compared to existing speech tokenization methods. Our approach demonstrates superior performance in preserving audio content, robustness to out-of-domain data, and offers a training-free, natural, and streamable representation. To address the high-dimensional nature of log-mel spectrograms, we propose an efficient parallel encoding and decoding method for high-dimensional tokens using an LM-style transformer architecture. This innovation enables us to develop RichTTS and RichASR, two models sharing the same architecture while achieving comparable or better results than specialized existing methods. Our results demonstrate the effectiveness of dmel in achieving high performance on both speech synthesis and recognition tasks within a unified framework, paving the way for efficient and effective joint modeling of speech and text.
