Table of Contents
Fetching ...

FlowDec: A flow-based full-band general audio codec with high perceptual quality

Simon Welker, Matthew Le, Ricky T. Q. Chen, Wei-Ning Hsu, Timo Gerkmann, Alexander Richard, Yi-Chiao Wu

TL;DR

FlowDec tackles the challenge of high-quality general audio coding at very low bitrates by decoupling the reconstruction from adversarial training and introducing a stochastic, conditional flow-matching postfilter. The method starts from a non-adversarial DAC backbone and learns a flow-based postfilter that conditions on the initial decoded output to produce perceptually pleasing reconstructions, enabling bitrate reductions below 8 kbit/s at 48 kHz with significantly fewer DNN evaluations than prior diffusion-based postfilters. Key contributions include a novel joint flow-matching formulation tailored for enhancement, a frequency-aware noise level scheme, and a multiscale CQT plus L1 waveform loss to improve low-frequency fidelity, all validated on diverse speech, music, and sound data with favorable objective and subjective results. FlowDec achieves competitive performance with GAN-based codecs while offering faster inference and broader applicability to general audio, signaling practical impact for low-bitrate, high-fidelity audio transmission and storage, with avenues for streaming and joint training in future work.

Abstract

We propose FlowDec, a neural full-band audio codec for general audio sampled at 48 kHz that combines non-adversarial codec training with a stochastic postfilter based on a novel conditional flow matching method. Compared to the prior work ScoreDec which is based on score matching, we generalize from speech to general audio and move from 24 kbit/s to as low as 4 kbit/s, while improving output quality and reducing the required postfilter DNN evaluations from 60 to 6 without any fine-tuning or distillation techniques. We provide theoretical insights and geometric intuitions for our approach in comparison to ScoreDec as well as another recent work that uses flow matching, and conduct ablation studies on our proposed components. We show that FlowDec is a competitive alternative to the recent GAN-dominated stream of neural codecs, achieving FAD scores better than those of the established GAN-based codec DAC and listening test scores that are on par, and producing qualitatively more natural reconstructions for speech and harmonic structures in music.

FlowDec: A flow-based full-band general audio codec with high perceptual quality

TL;DR

FlowDec tackles the challenge of high-quality general audio coding at very low bitrates by decoupling the reconstruction from adversarial training and introducing a stochastic, conditional flow-matching postfilter. The method starts from a non-adversarial DAC backbone and learns a flow-based postfilter that conditions on the initial decoded output to produce perceptually pleasing reconstructions, enabling bitrate reductions below 8 kbit/s at 48 kHz with significantly fewer DNN evaluations than prior diffusion-based postfilters. Key contributions include a novel joint flow-matching formulation tailored for enhancement, a frequency-aware noise level scheme, and a multiscale CQT plus L1 waveform loss to improve low-frequency fidelity, all validated on diverse speech, music, and sound data with favorable objective and subjective results. FlowDec achieves competitive performance with GAN-based codecs while offering faster inference and broader applicability to general audio, signaling practical impact for low-bitrate, high-fidelity audio transmission and storage, with avenues for streaming and joint training in future work.

Abstract

We propose FlowDec, a neural full-band audio codec for general audio sampled at 48 kHz that combines non-adversarial codec training with a stochastic postfilter based on a novel conditional flow matching method. Compared to the prior work ScoreDec which is based on score matching, we generalize from speech to general audio and move from 24 kbit/s to as low as 4 kbit/s, while improving output quality and reducing the required postfilter DNN evaluations from 60 to 6 without any fine-tuning or distillation techniques. We provide theoretical insights and geometric intuitions for our approach in comparison to ScoreDec as well as another recent work that uses flow matching, and conduct ablation studies on our proposed components. We show that FlowDec is a competitive alternative to the recent GAN-dominated stream of neural codecs, achieving FAD scores better than those of the established GAN-based codec DAC and listening test scores that are on par, and producing qualitatively more natural reconstructions for speech and harmonic structures in music.

Paper Structure

This paper contains 36 sections, 14 equations, 16 figures, 8 tables.

Figures (16)

  • Figure 1: Method overview: Codecs such as DAC kumar2023dac employ adversarial training, using multiple specialized discriminator networks trained jointly with the decoder. Our method FlowDec is trained in a non-adversarial two-stage fashion, removing these discriminators and instead adding a stochastic postfilter that can produce multiple enhanced estimates of the pretrained decoder.
  • Figure 2: Unconditional $q_0(x_0)$ versus our $q_0(x_0|x_1)$. Colored dots represent $y$, stars are associated $x^*$.
  • Figure 3: Flow field comparison at $t=0.7$ for our linear $\sigma_t$ (left) versus score-based SGMSE (center) and FlowAVSE with constant $\sigma_t$ (right) for a toy problem. The white dot is $y$, yellow stars are possible $x^*$, blue lines are sample trajectories, and the background color indicates the density $p_t$. SGMSE has highly curved trajectories and does not contract to $x^*$; FlowAVSE is non-contractive.
  • Figure 4: Mean objective metrics attained by compared methods on the test set at varying bitrates. Colored bands indicate 95% confidence intervals. SIGMOS is speech-only and is calculated only on the speech test files. FAD is multiplied by 100 for readability. Numbers can be found in \ref{['tab:full-objective-metrics']}.
  • Figure 5: Perception (FAD) -- distortion (SISDR) -- rate tradeoff blau2019rethinking of compared methods. Numbers next to points indicate the bitrate in kbit/s.
  • ...and 11 more figures