FlowMAC: Conditional Flow Matching for Audio Coding at Low Bit Rates
Nicola Pia, Martin Strauss, Markus Multrus, Bernd Edler
TL;DR
FlowMAC tackles the challenge of high-quality general audio coding at low bitrates by leveraging conditional flow matching to train a CNF-based mel spectrogram decoder conditioned on a compact latent code. The architecture combines a learned mel encoder/quantizer/decoder with a CNF mel decoder that is integrated via an ODE solver, enabling scalable, memory-efficient training and CPU-friendly inference. Empirical results show FlowMAC achieves state-of-the-art subjective quality at 3 kbps, rivaling GAN-based and DDPM-based codecs at twice the bitrate, while offering configurable complexity through NFE and CFG settings. The approach delivers practical real-time performance with a streamlined mel-to-audio synthesis backend, though it exhibits some limitations on out-of-distribution signals.
Abstract
This paper introduces FlowMAC, a novel neural audio codec for high-quality general audio compression at low bit rates based on conditional flow matching (CFM). FlowMAC jointly learns a mel spectrogram encoder, quantizer and decoder. At inference time the decoder integrates a continuous normalizing flow via an ODE solver to generate a high-quality mel spectrogram. This is the first time that a CFM-based approach is applied to general audio coding, enabling a scalable, simple and memory efficient training. Our subjective evaluations show that FlowMAC at 3 kbps achieves similar quality as state-of-the-art GAN-based and DDPM-based neural audio codecs at double the bit rate. Moreover, FlowMAC offers a tunable inference pipeline, which permits to trade off complexity and quality. This enables real-time coding on CPU, while maintaining high perceptual quality.
