MelTok: 2D Tokenization for Single-Codebook Audio Compression
Jingyi Li, Zhiyuan Zhao, Zhisheng Zhang, Yunfei Liu, Lijian Lin, Ye Zhu, Jiahao Wu, Qiuqiang Kong, Yu Li
TL;DR
MelTok introduces a 2D mel-spectrogram tokenizer to compress 44.1 kHz audio into a single codebook and pairs it with a token-based vocoder in a two-stage pipeline. The first stage learns discrete mel-tokens by reconstructing log-Mel spectrograms with both reconstruction and perceptual losses, while the second stage reconstructs waveforms from these tokens using a Lipschitz-stable vocoder and adversarial training. The approach delivers competitive perceptual quality and superior spectral fidelity compared with multi-codebook baselines, and maintains discriminative information for downstream tasks. The work also provides theoretical bounded-error analysis for error propagation through discrete codes to the final waveform, informing architecture choices and suggesting strong potential for audio-language modeling and downstream understanding tasks.
Abstract
Large Audio Language Models (LALMs) have emerged with strong performance across diverse audio understanding tasks and can be further enhanced by neural audio codecs. Transitioning from multi-layer residual vector quantizers to a single-layer quantizer has been shown to facilitate more efficient downstream language models decoding. However, the ability of a single codebook to capture fine-grained acoustic details remains limited, as the frequency-variant nature of 1D tokenizers leads to redundancy. To address this issue, we propose MelTok, a two-dimensional (2D) tokenizer that effectively compresses acoustic details of 44.1 KHz audio into a single codebook. The tokenizer encodes audio into a more compact representation than one-dimensional tokenizers. Furthermore, to recover audio from mel-spectrogram tokens, we propose a token-based vocoder. Both objective and subjective evaluations demonstrate that MelTok achieves quality comparable to multi-codebook codecs and outperforms existing state-of-the-art neural codecs with a single codebook on high-fidelity audio reconstruction. By preserving acoustic details, MelTok offers a strong representation for downstream understanding tasks.
