Table of Contents
Fetching ...

S-PRESSO: Ultra Low Bitrate Sound Effect Compression With Diffusion Autoencoders And Offline Quantization

Zineb Lahrichi, Gaëtan Hadjeres, Gaël Richard, Geoffroy Peeters

TL;DR

S-PRESSO introduces a diffusion autoencoder that operates in the latent space of a pretrained AudioAE to compress 48 kHz sound effects at ultra-low bitrates. It combines continuous diffusion-based encoding with offline neural quantization (Qinco2) and subsequent diffusion-decoder finetuning to maintain perceptual quality under extreme frame-rate reduction. Across both continuous and discrete baselines, S-PRESSO achieves up to 750× compression and outperforms rivals in audio quality and acoustic similarity, validated by objective metrics and human MUSHRA tests. The work demonstrates that diffusion priors can shift ultra-low bitrate audio compression toward acoustic similarity and realism, with practical implications for interactive media and streaming.

Abstract

Neural audio compression models have recently achieved extreme compression rates, enabling efficient latent generative modeling. Conversely, latent generative models have been applied to compression, pushing the limits of continuous and discrete approaches. However, existing methods remain constrained to low-resolution audio and degrade substantially at very low bitrates, where audible artifacts are prominent. In this paper, we present S-PRESSO, a 48kHz sound effect compression model that produces both continuous and discrete embeddings at ultra-low bitrates, down to 0.096 kbps, via offline quantization. Our model relies on a pretrained latent diffusion model to decode compressed audio embeddings learned by a latent encoder. Leveraging the generative priors of the diffusion decoder, we achieve extremely low frame rates, down to 1Hz (750x compression rate), producing convincing and realistic reconstructions at the cost of exact fidelity. Despite operating at high compression rates, we demonstrate that S-PRESSO outperforms both continuous and discrete baselines in audio quality, acoustic similarity and reconstruction metrics.

S-PRESSO: Ultra Low Bitrate Sound Effect Compression With Diffusion Autoencoders And Offline Quantization

TL;DR

S-PRESSO introduces a diffusion autoencoder that operates in the latent space of a pretrained AudioAE to compress 48 kHz sound effects at ultra-low bitrates. It combines continuous diffusion-based encoding with offline neural quantization (Qinco2) and subsequent diffusion-decoder finetuning to maintain perceptual quality under extreme frame-rate reduction. Across both continuous and discrete baselines, S-PRESSO achieves up to 750× compression and outperforms rivals in audio quality and acoustic similarity, validated by objective metrics and human MUSHRA tests. The work demonstrates that diffusion priors can shift ultra-low bitrate audio compression toward acoustic similarity and realism, with practical implications for interactive media and streaming.

Abstract

Neural audio compression models have recently achieved extreme compression rates, enabling efficient latent generative modeling. Conversely, latent generative models have been applied to compression, pushing the limits of continuous and discrete approaches. However, existing methods remain constrained to low-resolution audio and degrade substantially at very low bitrates, where audible artifacts are prominent. In this paper, we present S-PRESSO, a 48kHz sound effect compression model that produces both continuous and discrete embeddings at ultra-low bitrates, down to 0.096 kbps, via offline quantization. Our model relies on a pretrained latent diffusion model to decode compressed audio embeddings learned by a latent encoder. Leveraging the generative priors of the diffusion decoder, we achieve extremely low frame rates, down to 1Hz (750x compression rate), producing convincing and realistic reconstructions at the cost of exact fidelity. Despite operating at high compression rates, we demonstrate that S-PRESSO outperforms both continuous and discrete baselines in audio quality, acoustic similarity and reconstruction metrics.
Paper Structure (17 sections, 4 figures, 2 tables)

This paper contains 17 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overview of our method. Step 1: An audio clip is encoded into latent vectors $x_0$ by a low-compression audio autoencoder. It is then compressed into latents $z$, which are upsampled by $f_\phi$ to condition the decoder $D_\theta$, a DiT pretrained to reconstruct $x_0$ from noised inputs. $D_\theta$ is finetuned using LoRA adapters, jointly trained with the latent encoder $g_{\psi}$ and $f_\phi$. Step 2: The features $z$ are then quantized offline into $z_q$. Step 3: the diffusion decoder $D_\theta$ is finetuned on $z_q$, to compensate for quantization-induced degradation.
  • Figure 2: (a) Overview of the latent encoder architecture (b) Conditioning mechanism within the diffusion decoder.
  • Figure 3: MUSHRA scores for S-PRESSO, SemantiCodec and a 3.5kHz low-pass anchor, evaluated at $\sim$ 1.35 kbps and $\sim$ 0.3 kbps.
  • Figure 4: Evaluation of S-PRESSO at varying bitrates and framerates.