High-Fidelity Music Vocoder using Neural Audio Codecs
Luca A. Lanzendörfer, Florian Grötschla, Michael Ungersböck, Roger Wattenhofer
TL;DR
This work tackles high-fidelity polyphonic music synthesis from mel spectrograms by introducing DisCoder, a GAN-based encoder–decoder that maps mel features into the Descript Audio Codec latent space and reconstructs audio via a fine-tuned DAC decoder. The training uses a two-stage strategy: first align the encoder with the DAC latent space, then remove the constraint and add a skip connection to preserve initial mel information. DisCoder achieves state-of-the-art results on music objective metrics and MUSHRA listening tests, while remaining competitive for speech, highlighting its potential as a universal vocoder. The approach combines neural vocoder techniques with neural audio codecs to produce high-quality 44.1 kHz audio and provides publicly available code and checkpoints for reproducibility and broader applications.
Abstract
While neural vocoders have made significant progress in high-fidelity speech synthesis, their application on polyphonic music has remained underexplored. In this work, we propose DisCoder, a neural vocoder that leverages a generative adversarial encoder-decoder architecture informed by a neural audio codec to reconstruct high-fidelity 44.1 kHz audio from mel spectrograms. Our approach first transforms the mel spectrogram into a lower-dimensional representation aligned with the Descript Audio Codec (DAC) latent space before reconstructing it to an audio signal using a fine-tuned DAC decoder. DisCoder achieves state-of-the-art performance in music synthesis on several objective metrics and in a MUSHRA listening study. Our approach also shows competitive performance in speech synthesis, highlighting its potential as a universal vocoder.
