Table of Contents
Fetching ...

Benchmarking Language Modeling for Lossless Compression of Full-Fidelity Audio

Phillip Long, Zachary Novack, Chris Donahue

TL;DR

Tilobyte is proposed, a byte-level tokenization schema for full resolution audio, improving vocabulary scaling from $O(2^{b})$ to $O(1)$ and enabling the first tractable 24-bit LM-based lossless compression.

Abstract

Autoregressive "language" models (LMs) trained on raw waveforms can be repurposed for lossless audio compression, but prior work is limited to 8-bit audio, leaving open whether such approaches work for practical settings (16/24-bit) and can compete with existing codecs. We benchmark LM-based compression on full-fidelity audio across diverse domains (music, speech, bioacoustics), sampling rates (16kHz-48kHz), and bit depths (8, 16, 24-bit). Standard sample-level tokenization becomes intractable at higher bit depths due to vocabulary size (65K for 16-bit; 16.7M for 24-bit). We propose Trilobyte, a byte-level tokenization schema for full resolution audio, improving vocabulary scaling from $O(2^{b})$ to $O(1)$ and enabling the first tractable 24-bit LM-based lossless compression. While LMs consistently outperform FLAC and yield state-of-the-art compression at 8-bit and 16-bit, we observe that compression gains become more modest as bit depth increases beyond 8-bit.

Benchmarking Language Modeling for Lossless Compression of Full-Fidelity Audio

TL;DR

Tilobyte is proposed, a byte-level tokenization schema for full resolution audio, improving vocabulary scaling from to and enabling the first tractable 24-bit LM-based lossless compression.

Abstract

Autoregressive "language" models (LMs) trained on raw waveforms can be repurposed for lossless audio compression, but prior work is limited to 8-bit audio, leaving open whether such approaches work for practical settings (16/24-bit) and can compete with existing codecs. We benchmark LM-based compression on full-fidelity audio across diverse domains (music, speech, bioacoustics), sampling rates (16kHz-48kHz), and bit depths (8, 16, 24-bit). Standard sample-level tokenization becomes intractable at higher bit depths due to vocabulary size (65K for 16-bit; 16.7M for 24-bit). We propose Trilobyte, a byte-level tokenization schema for full resolution audio, improving vocabulary scaling from to and enabling the first tractable 24-bit LM-based lossless compression. While LMs consistently outperform FLAC and yield state-of-the-art compression at 8-bit and 16-bit, we observe that compression gains become more modest as bit depth increases beyond 8-bit.
Paper Structure (18 sections, 5 figures, 1 table)

This paper contains 18 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Tokenization strategies for language model compression. Standard sample-level tokenization (top) yields vocabulary size $|\mathcal{V}| = 2^{b}$. This exponential scaling inhibits modeling of industry-standard bit depths ($16$, $24$). Trilobyte's hierarchical byte-level tokenization (bottom) decomposes samples into bytes, yielding constant $|\mathcal{V}| = 256$ regardless of bit depth (at the cost of increasing sequence length by $\left\lceil {b/8} \right\rceil$). Both feed into an AR LM and arithmetic coder, but Trilobyte enables tractable 24-bit modeling.
  • Figure 2: FLAC compression performance across diverse audio domains at 8-bit and 16-bit quantization levels. Birdvox achieves exceptional compression ($\sim$6x at 8-bit), perhaps reflecting the sparse and structurally constrained nature of bird vocalizations, which are highly predictable under linear predictive coding. Meanwhile, speech and music datasets show more modest gains. 16-bit audio generally achieves 1.5--2.5x compression, with diminishing returns beyond FLAC level 3. Note that we disable FLAC's verbatim, constant, and fixed subframe types, and that we do not evaluate Beethoven, YouTube Mix, or SC09 beyond 8-bit because they are 8-bit datasets.
  • Figure 3: Compression rate comparison across FLAC, DAC, EnCodec, and Custom DAC compressors on MusDB18 mixes. FLAC achieves the best compression, at approximately 1.8x, while the NAC-based approaches underperform, with EnCodec actually increasing file size.
  • Figure 4: Residual distribution comparison showing residual magnitudes (note the log scale) for FLAC, DAC, EnCodec, and Custom DAC compressors. FLAC residuals follow a geometric distribution with a mean absolute residual of 156.34, while DAC, EnCodec, and Custom DAC residuals are more uniformly distributed regardless of codebook level, with mean absolute residuals of 1,603.54 (DAC), 18,376.66 (EnCodec), and 1,245.76 (Custom DAC) -- an order of magnitude larger than FLAC.
  • Figure 5: In-context LM-based compression performance with the method defined in Delétang et al. deletang2023language and Li et al. li2025lossless using pre-trained language models (Llama-2-7B and Llama-2-13B touvron2023llama2) across diverse audio domains at 8-bit and 16-bit quantization. We also report FLAC compression results at compression level 8, the maximum. Model scaling (7B to 13B) shows minimal gains at 8-bit and some improvements at 16-bit, especially for complex datasets. This method underperforms FLAC on most signals, with the exception of SC09 and Epidemic Sound at 8-bit.