Table of Contents
Fetching ...

MelTok: 2D Tokenization for Single-Codebook Audio Compression

Jingyi Li, Zhiyuan Zhao, Zhisheng Zhang, Yunfei Liu, Lijian Lin, Ye Zhu, Jiahao Wu, Qiuqiang Kong, Yu Li

TL;DR

MelTok introduces a 2D mel-spectrogram tokenizer to compress 44.1 kHz audio into a single codebook and pairs it with a token-based vocoder in a two-stage pipeline. The first stage learns discrete mel-tokens by reconstructing log-Mel spectrograms with both reconstruction and perceptual losses, while the second stage reconstructs waveforms from these tokens using a Lipschitz-stable vocoder and adversarial training. The approach delivers competitive perceptual quality and superior spectral fidelity compared with multi-codebook baselines, and maintains discriminative information for downstream tasks. The work also provides theoretical bounded-error analysis for error propagation through discrete codes to the final waveform, informing architecture choices and suggesting strong potential for audio-language modeling and downstream understanding tasks.

Abstract

Large Audio Language Models (LALMs) have emerged with strong performance across diverse audio understanding tasks and can be further enhanced by neural audio codecs. Transitioning from multi-layer residual vector quantizers to a single-layer quantizer has been shown to facilitate more efficient downstream language models decoding. However, the ability of a single codebook to capture fine-grained acoustic details remains limited, as the frequency-variant nature of 1D tokenizers leads to redundancy. To address this issue, we propose MelTok, a two-dimensional (2D) tokenizer that effectively compresses acoustic details of 44.1 KHz audio into a single codebook. The tokenizer encodes audio into a more compact representation than one-dimensional tokenizers. Furthermore, to recover audio from mel-spectrogram tokens, we propose a token-based vocoder. Both objective and subjective evaluations demonstrate that MelTok achieves quality comparable to multi-codebook codecs and outperforms existing state-of-the-art neural codecs with a single codebook on high-fidelity audio reconstruction. By preserving acoustic details, MelTok offers a strong representation for downstream understanding tasks.

MelTok: 2D Tokenization for Single-Codebook Audio Compression

TL;DR

MelTok introduces a 2D mel-spectrogram tokenizer to compress 44.1 kHz audio into a single codebook and pairs it with a token-based vocoder in a two-stage pipeline. The first stage learns discrete mel-tokens by reconstructing log-Mel spectrograms with both reconstruction and perceptual losses, while the second stage reconstructs waveforms from these tokens using a Lipschitz-stable vocoder and adversarial training. The approach delivers competitive perceptual quality and superior spectral fidelity compared with multi-codebook baselines, and maintains discriminative information for downstream tasks. The work also provides theoretical bounded-error analysis for error propagation through discrete codes to the final waveform, informing architecture choices and suggesting strong potential for audio-language modeling and downstream understanding tasks.

Abstract

Large Audio Language Models (LALMs) have emerged with strong performance across diverse audio understanding tasks and can be further enhanced by neural audio codecs. Transitioning from multi-layer residual vector quantizers to a single-layer quantizer has been shown to facilitate more efficient downstream language models decoding. However, the ability of a single codebook to capture fine-grained acoustic details remains limited, as the frequency-variant nature of 1D tokenizers leads to redundancy. To address this issue, we propose MelTok, a two-dimensional (2D) tokenizer that effectively compresses acoustic details of 44.1 KHz audio into a single codebook. The tokenizer encodes audio into a more compact representation than one-dimensional tokenizers. Furthermore, to recover audio from mel-spectrogram tokens, we propose a token-based vocoder. Both objective and subjective evaluations demonstrate that MelTok achieves quality comparable to multi-codebook codecs and outperforms existing state-of-the-art neural codecs with a single codebook on high-fidelity audio reconstruction. By preserving acoustic details, MelTok offers a strong representation for downstream understanding tasks.

Paper Structure

This paper contains 29 sections, 3 theorems, 24 equations, 6 figures, 8 tables.

Key Result

Lemma 3.2

If $f$ is locally Lipschitz continuous with constant $L$, then

Figures (6)

  • Figure 1: Mel-spectrogram comparison of original and reconstructed waveforms produced by different codecs. MelTok accurately reconstructs the high-frequency details.
  • Figure 2: Loss of high-frequency (above 20k Hz) detail in a waveform-based codec. Left: spectrogram of result from a waveform-based codec using 4 quantizers. Right: ground truth (GT). Noticeable differences exist in the high-frequency Mel spectra, resulting in poor reconstruction of high-frequency components, the bright ringing sound in the original sound.
  • Figure 3: t-SNE visualization of latent embeddings produced by the 1D and 2D tokenizers. Each point represents the latent embedding of a single time frame, and each color denotes a different frequency band.
  • Figure 4: Comparison of reconstructed log-mel spectrograms trained with different loss. The bottom row shows a zoomed-in view, highlighting the differences in smoothness and spectral sharpness.
  • Figure 5: Training paradigm of MelTok.
  • ...and 1 more figures

Theorems & Definitions (5)

  • Lemma 3.2: Lipschitz Bound
  • Theorem 3.3: Bounded Waveform Error
  • Lemma A.1: ISTFT Lipschitz Continuity
  • proof
  • Remark A.2