Table of Contents
Fetching ...

High-Fidelity Music Vocoder using Neural Audio Codecs

Luca A. Lanzendörfer, Florian Grötschla, Michael Ungersböck, Roger Wattenhofer

TL;DR

This work tackles high-fidelity polyphonic music synthesis from mel spectrograms by introducing DisCoder, a GAN-based encoder–decoder that maps mel features into the Descript Audio Codec latent space and reconstructs audio via a fine-tuned DAC decoder. The training uses a two-stage strategy: first align the encoder with the DAC latent space, then remove the constraint and add a skip connection to preserve initial mel information. DisCoder achieves state-of-the-art results on music objective metrics and MUSHRA listening tests, while remaining competitive for speech, highlighting its potential as a universal vocoder. The approach combines neural vocoder techniques with neural audio codecs to produce high-quality 44.1 kHz audio and provides publicly available code and checkpoints for reproducibility and broader applications.

Abstract

While neural vocoders have made significant progress in high-fidelity speech synthesis, their application on polyphonic music has remained underexplored. In this work, we propose DisCoder, a neural vocoder that leverages a generative adversarial encoder-decoder architecture informed by a neural audio codec to reconstruct high-fidelity 44.1 kHz audio from mel spectrograms. Our approach first transforms the mel spectrogram into a lower-dimensional representation aligned with the Descript Audio Codec (DAC) latent space before reconstructing it to an audio signal using a fine-tuned DAC decoder. DisCoder achieves state-of-the-art performance in music synthesis on several objective metrics and in a MUSHRA listening study. Our approach also shows competitive performance in speech synthesis, highlighting its potential as a universal vocoder.

High-Fidelity Music Vocoder using Neural Audio Codecs

TL;DR

This work tackles high-fidelity polyphonic music synthesis from mel spectrograms by introducing DisCoder, a GAN-based encoder–decoder that maps mel features into the Descript Audio Codec latent space and reconstructs audio via a fine-tuned DAC decoder. The training uses a two-stage strategy: first align the encoder with the DAC latent space, then remove the constraint and add a skip connection to preserve initial mel information. DisCoder achieves state-of-the-art results on music objective metrics and MUSHRA listening tests, while remaining competitive for speech, highlighting its potential as a universal vocoder. The approach combines neural vocoder techniques with neural audio codecs to produce high-quality 44.1 kHz audio and provides publicly available code and checkpoints for reproducibility and broader applications.

Abstract

While neural vocoders have made significant progress in high-fidelity speech synthesis, their application on polyphonic music has remained underexplored. In this work, we propose DisCoder, a neural vocoder that leverages a generative adversarial encoder-decoder architecture informed by a neural audio codec to reconstruct high-fidelity 44.1 kHz audio from mel spectrograms. Our approach first transforms the mel spectrogram into a lower-dimensional representation aligned with the Descript Audio Codec (DAC) latent space before reconstructing it to an audio signal using a fine-tuned DAC decoder. DisCoder achieves state-of-the-art performance in music synthesis on several objective metrics and in a MUSHRA listening study. Our approach also shows competitive performance in speech synthesis, highlighting its potential as a universal vocoder.

Paper Structure

This paper contains 11 sections, 5 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Proposed DisCoder architecture. The mel spectrogram is encoded into a low-dimensional latent space before being decoded to a 44.1 kHz waveform. During the first stage of training, the latent space is aligned with the DAC prior. During the second stage of training this constraint is removed, and a skip connection is introduced to preserve information encoded in the inital mel spectrogram.
  • Figure 2: Comparison of mel spectrogram reconstruction quality between HiFi-GAN, BigVGAN-v2, and DisCoder against the ground truth. The three model columns show the absolute error between the mel spectrogram of the synthesized audio and ground truth audio. The rows represent two unseen music clips from the MTG-Jamendo dataset. DisCoder contains significantly less pronounced errors compared to HiFi-GAN and BigVGAN-v2.