Table of Contents
Fetching ...

A Generative-First Neural Audio Autoencoder

Jonah Casebeer, Ge Zhu, Zhepei Wang, Nicholas J. Bryan

TL;DR

This work introduces GenAE, a generative-first neural audio autoencoder that unifies continuous and discrete latents and multiple audio channel formats within one architecture. By combining architectural innovations (efficient activations, early downsampling, mel-spectrogram fusion, and windowed attention), training strategies (multi-format augmentation and coprime multi-resolution losses), and a post-training discretization path (RVQ), GenAE achieves dramatic encoding speedups and lower token rates while maintaining competitive reconstruction quality. It demonstrates a unified model that supports both diffusion-like (continuous) and language-model-like (discrete) generative workflows across formats, enabling long-context audio generation with substantially reduced computational costs. The results show GenAE attains a favorable rate–distortion frontier, enables up to 6.5× longer context than prior baselines, and compresses typical 60-second tracks to under 800 tokens, significantly lowering the barrier to large-scale generative audio modeling.

Abstract

Neural autoencoders underpin generative models. Practical, large-scale use of neural autoencoders for generative modeling necessitates fast encoding, low latent rates, and a single model across representations. Existing approaches are reconstruction-first: they incur high latent rates, slow encoding, and separate architectures for discrete vs. continuous latents and for different audio channel formats, hindering workflows from preprocessing to inference conditioning. We introduce a generative-first architecture for audio autoencoding that increases temporal downsampling from 2048x to 3360x and supports continuous and discrete representations and common audio channel formats in one model. By balancing compression, quality, and speed, it delivers 10x faster encoding, 1.6x lower rates, and eliminates channel-format-specific variants while maintaining competitive reconstruction quality. This enables applications previously constrained by processing costs: a 60-second mono signal compresses to 788 tokens, making generative modeling more tractable.

A Generative-First Neural Audio Autoencoder

TL;DR

This work introduces GenAE, a generative-first neural audio autoencoder that unifies continuous and discrete latents and multiple audio channel formats within one architecture. By combining architectural innovations (efficient activations, early downsampling, mel-spectrogram fusion, and windowed attention), training strategies (multi-format augmentation and coprime multi-resolution losses), and a post-training discretization path (RVQ), GenAE achieves dramatic encoding speedups and lower token rates while maintaining competitive reconstruction quality. It demonstrates a unified model that supports both diffusion-like (continuous) and language-model-like (discrete) generative workflows across formats, enabling long-context audio generation with substantially reduced computational costs. The results show GenAE attains a favorable rate–distortion frontier, enables up to 6.5× longer context than prior baselines, and compresses typical 60-second tracks to under 800 tokens, significantly lowering the barrier to large-scale generative audio modeling.

Abstract

Neural autoencoders underpin generative models. Practical, large-scale use of neural autoencoders for generative modeling necessitates fast encoding, low latent rates, and a single model across representations. Existing approaches are reconstruction-first: they incur high latent rates, slow encoding, and separate architectures for discrete vs. continuous latents and for different audio channel formats, hindering workflows from preprocessing to inference conditioning. We introduce a generative-first architecture for audio autoencoding that increases temporal downsampling from 2048x to 3360x and supports continuous and discrete representations and common audio channel formats in one model. By balancing compression, quality, and speed, it delivers 10x faster encoding, 1.6x lower rates, and eliminates channel-format-specific variants while maintaining competitive reconstruction quality. This enables applications previously constrained by processing costs: a 60-second mono signal compresses to 788 tokens, making generative modeling more tractable.
Paper Structure (16 sections, 1 equation, 3 figures, 1 table)

This paper contains 16 sections, 1 equation, 3 figures, 1 table.

Figures (3)

  • Figure 1: Encode log RTF (solid) and decode log RTF (striped) for GenAE ablations and baselines. GenAE modifications are split into "speed" and "quality" categories. Baselines are shown on the right. Faster encoding accelerates generative workflows.
  • Figure 2: GenAE Model Architecture: DWPW is a standard depth-wise/point-wise layer with a dilation (1/3/9). Attn Stack is a standard multi-head windowed attention block. SLite is the SnakeLite activation. TCN is a standard dilated residual convolution block. Mel represents inputting or outputting a mel-spectrogram. Format represents an audio channel format token.
  • Figure 3: Stereo rate-distortion vs. latent rate (Hz). GenAE at 13 Hz matches baselines at far lower rates and at 36 Hz surpasses all baselines. Lower rates reduce tokens and memory for long-context generation. GenAE is Pareto-optimal on the compression/reconstruction frontier in both continuous (-KL) and discrete (-VQ) latent modes. PESQ-WB and Audiobox Aesthetics follow a similar trend. Models in the legend but not shown exceed plot ranges. Models with a * only run at 24 KHz. The Pareto frontier is shown in grey.