Table of Contents
Fetching ...

Gull: A Generative Multifunctional Audio Codec

Yi Luo, Jianwei Yu, Hangting Chen, Rongzhi Gu, Chao Weng

TL;DR

Gull presents a universal-sample-rate neural audio codec that operates in the frequency domain with band-split modeling, gain-shape representations, and hierarchical SRVQ quantization. It features an elastic decoder capable of adapting width and depth during inference and employs adversarial training with multi-resolution STFT discriminators to balance distortion and perceptual quality, including optional bandwidth extension without bitrate increase. Across speech and music, Gull matches or surpasses traditional codecs and a strong neural baseline at multiple sample rates, bitrates, and model complexities, while maintaining low latency. This work points to practical impact in real-time communication, audio generation pipelines, and codec-language-model integration, with avenues for further optimization and end-to-end extensions.

Abstract

We introduce Gull, a generative multifunctional audio codec. Gull is a general purpose neural audio compression and decompression model which can be applied to a wide range of tasks and applications such as real-time communication, audio super-resolution, and codec language models. The key components of Gull include (1) universal-sample-rate modeling via subband modeling schemes motivated by recent progress in audio source separation, (2) gain-shape representations motivated by traditional audio codecs, (3) improved residual vector quantization modules, (4) elastic decoder network that enables user-defined model size and complexity during inference time, (5) built-in ability for audio super-resolution without the increase of bitrate. We compare Gull with existing traditional and neural audio codecs and show that Gull is able to achieve on par or better performance across various sample rates, bitrates and model complexities in both subjective and objective evaluation metrics.

Gull: A Generative Multifunctional Audio Codec

TL;DR

Gull presents a universal-sample-rate neural audio codec that operates in the frequency domain with band-split modeling, gain-shape representations, and hierarchical SRVQ quantization. It features an elastic decoder capable of adapting width and depth during inference and employs adversarial training with multi-resolution STFT discriminators to balance distortion and perceptual quality, including optional bandwidth extension without bitrate increase. Across speech and music, Gull matches or surpasses traditional codecs and a strong neural baseline at multiple sample rates, bitrates, and model complexities, while maintaining low latency. This work points to practical impact in real-time communication, audio generation pipelines, and codec-language-model integration, with avenues for further optimization and end-to-end extensions.

Abstract

We introduce Gull, a generative multifunctional audio codec. Gull is a general purpose neural audio compression and decompression model which can be applied to a wide range of tasks and applications such as real-time communication, audio super-resolution, and codec language models. The key components of Gull include (1) universal-sample-rate modeling via subband modeling schemes motivated by recent progress in audio source separation, (2) gain-shape representations motivated by traditional audio codecs, (3) improved residual vector quantization modules, (4) elastic decoder network that enables user-defined model size and complexity during inference time, (5) built-in ability for audio super-resolution without the increase of bitrate. We compare Gull with existing traditional and neural audio codecs and show that Gull is able to achieve on par or better performance across various sample rates, bitrates and model complexities in both subjective and objective evaluation metrics.
Paper Structure (17 sections, 12 equations, 3 figures, 2 tables)

This paper contains 17 sections, 12 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Flowchart of the model architecture of Gull codec.
  • Figure 2: Flowchart of the STFT discriminator.
  • Figure 3: Subjective evaluation results for Gull and other benchmark conventional and neural audio codecs.