Table of Contents
Fetching ...

Mel-Spectrogram Inversion via Alternating Direction Method of Multipliers

Yoshiki Masuyama, Natsuki Ueno, Nobutaka Ono

TL;DR

This work tackles the challenge of reconstructing a time-domain signal from a mel-spectrogram by jointly estimating the full-band STFT magnitude and phase. It introduces an ADMM-based mel-spectrogram inversion method that reformulates the problem with auxiliary variables and an augmented Lagrangian, enabling efficient, decomposed proximal updates that exploit STFT redundancy. Empirical results on speech and foley sounds show that the proposed approach outperforms cascaded pipelines and previous iPALM-based joint methods, achieving faster convergence and higher perceptual and intelligibility metrics. The method is training-free and versatile across signal types, with future work suggested to integrate neural priors for further improvements.

Abstract

Signal reconstruction from its mel-spectrogram is known as mel-spectrogram inversion and has many applications, including speech and foley sound synthesis. In this paper, we propose a mel-spectrogram inversion method based on a rigorous optimization algorithm. To reconstruct a time-domain signal with inverse short-time Fourier transform (STFT), both full-band STFT magnitude and phase should be predicted from a given mel-spectrogram. Their joint estimation has outperformed the cascaded full-band magnitude prediction and phase reconstruction by preventing error accumulation. However, the existing joint estimation method requires many iterations, and there remains room for performance improvement. We present an alternating direction method of multipliers (ADMM)-based joint estimation method motivated by its success in various nonconvex optimization problems including phase reconstruction. An efficient update of each variable is derived by exploiting the conditional independence among the variables. Our experiments demonstrate the effectiveness of the proposed method on speech and foley sounds.

Mel-Spectrogram Inversion via Alternating Direction Method of Multipliers

TL;DR

This work tackles the challenge of reconstructing a time-domain signal from a mel-spectrogram by jointly estimating the full-band STFT magnitude and phase. It introduces an ADMM-based mel-spectrogram inversion method that reformulates the problem with auxiliary variables and an augmented Lagrangian, enabling efficient, decomposed proximal updates that exploit STFT redundancy. Empirical results on speech and foley sounds show that the proposed approach outperforms cascaded pipelines and previous iPALM-based joint methods, achieving faster convergence and higher perceptual and intelligibility metrics. The method is training-free and versatile across signal types, with future work suggested to integrate neural priors for further improvements.

Abstract

Signal reconstruction from its mel-spectrogram is known as mel-spectrogram inversion and has many applications, including speech and foley sound synthesis. In this paper, we propose a mel-spectrogram inversion method based on a rigorous optimization algorithm. To reconstruct a time-domain signal with inverse short-time Fourier transform (STFT), both full-band STFT magnitude and phase should be predicted from a given mel-spectrogram. Their joint estimation has outperformed the cascaded full-band magnitude prediction and phase reconstruction by preventing error accumulation. However, the existing joint estimation method requires many iterations, and there remains room for performance improvement. We present an alternating direction method of multipliers (ADMM)-based joint estimation method motivated by its success in various nonconvex optimization problems including phase reconstruction. An efficient update of each variable is derived by exploiting the conditional independence among the variables. Our experiments demonstrate the effectiveness of the proposed method on speech and foley sounds.
Paper Structure (13 sections, 22 equations, 4 figures, 2 algorithms)

This paper contains 13 sections, 22 equations, 4 figures, 2 algorithms.

Figures (4)

  • Figure 1: SCM, PESQ, and ESTOI with different hyperparameters. In the top left panel, SCM with respect to $\lambda$ is depicted while $\rho$ is fixed at $0.1$. In the other panels, $\rho$ is changed while $\lambda$ is fixed at $5000$.
  • Figure 2: Average SCM with respect to the number of iterations and the boxplot of SCM with 500 iterations.
  • Figure 3: Boxplots of PESQ and ESTOI with 500 iterations.
  • Figure 4: SC of the iPALM and ADMM-based methods on foley sounds. Lower SC means better reconstruction.