Table of Contents
Fetching ...

FlowMAC: Conditional Flow Matching for Audio Coding at Low Bit Rates

Nicola Pia, Martin Strauss, Markus Multrus, Bernd Edler

TL;DR

FlowMAC tackles the challenge of high-quality general audio coding at low bitrates by leveraging conditional flow matching to train a CNF-based mel spectrogram decoder conditioned on a compact latent code. The architecture combines a learned mel encoder/quantizer/decoder with a CNF mel decoder that is integrated via an ODE solver, enabling scalable, memory-efficient training and CPU-friendly inference. Empirical results show FlowMAC achieves state-of-the-art subjective quality at 3 kbps, rivaling GAN-based and DDPM-based codecs at twice the bitrate, while offering configurable complexity through NFE and CFG settings. The approach delivers practical real-time performance with a streamlined mel-to-audio synthesis backend, though it exhibits some limitations on out-of-distribution signals.

Abstract

This paper introduces FlowMAC, a novel neural audio codec for high-quality general audio compression at low bit rates based on conditional flow matching (CFM). FlowMAC jointly learns a mel spectrogram encoder, quantizer and decoder. At inference time the decoder integrates a continuous normalizing flow via an ODE solver to generate a high-quality mel spectrogram. This is the first time that a CFM-based approach is applied to general audio coding, enabling a scalable, simple and memory efficient training. Our subjective evaluations show that FlowMAC at 3 kbps achieves similar quality as state-of-the-art GAN-based and DDPM-based neural audio codecs at double the bit rate. Moreover, FlowMAC offers a tunable inference pipeline, which permits to trade off complexity and quality. This enables real-time coding on CPU, while maintaining high perceptual quality.

FlowMAC: Conditional Flow Matching for Audio Coding at Low Bit Rates

TL;DR

FlowMAC tackles the challenge of high-quality general audio coding at low bitrates by leveraging conditional flow matching to train a CNF-based mel spectrogram decoder conditioned on a compact latent code. The architecture combines a learned mel encoder/quantizer/decoder with a CNF mel decoder that is integrated via an ODE solver, enabling scalable, memory-efficient training and CPU-friendly inference. Empirical results show FlowMAC achieves state-of-the-art subjective quality at 3 kbps, rivaling GAN-based and DDPM-based codecs at twice the bitrate, while offering configurable complexity through NFE and CFG settings. The approach delivers practical real-time performance with a streamlined mel-to-audio synthesis backend, though it exhibits some limitations on out-of-distribution signals.

Abstract

This paper introduces FlowMAC, a novel neural audio codec for high-quality general audio compression at low bit rates based on conditional flow matching (CFM). FlowMAC jointly learns a mel spectrogram encoder, quantizer and decoder. At inference time the decoder integrates a continuous normalizing flow via an ODE solver to generate a high-quality mel spectrogram. This is the first time that a CFM-based approach is applied to general audio coding, enabling a scalable, simple and memory efficient training. Our subjective evaluations show that FlowMAC at 3 kbps achieves similar quality as state-of-the-art GAN-based and DDPM-based neural audio codecs at double the bit rate. Moreover, FlowMAC offers a tunable inference pipeline, which permits to trade off complexity and quality. This enables real-time coding on CPU, while maintaining high perceptual quality.
Paper Structure (13 sections, 4 equations, 3 figures, 1 table)

This paper contains 13 sections, 4 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: FlowMAC architecture. The top illustrates the high level pipeline. The bottom left shows the structure of the mel spectrogram encoder and decoder. The bottom right denotes the details on the CFM module.
  • Figure 2: Results for P.808 DCR test with 46 listener and 95% CI.
  • Figure 3: Results for MUSHRA test with 14 listeners and 95% CI.