Learning to Solve Inverse Problems for Perceptual Sound Matching

Han Han; Vincent Lostanlen; Mathieu Lagrange

Learning to Solve Inverse Problems for Perceptual Sound Matching

Han Han, Vincent Lostanlen, Mathieu Lagrange

TL;DR

The paper tackles inverse problems in perceptual sound matching (PSM) by proposing perceptual-neural-physical (PNP) loss, a quadratic, precomputable surrogate of a perceptual loss that leverages the Jacobian of $(\boldsymbol{\Phi} \circ \boldsymbol{g})$ through a Riemannian metric $\mathbf{M}(\boldsymbol{\theta})$. An adaptive Levenberg–Marquardt–style damping term $\lambda$ stabilizes optimization when $\mathbf{M}$ is ill-conditioned, enabling a smooth transition from parameter-space loss to perceptual loss during training. The approach is instantiated with joint time--frequency scattering (JTFS) as the perceptual feature and evaluated on two differentiable nonstationary synthesizers: an AM/FM arpeggiator and a FTM-based rectangular membrane drum model; results show JTFS-based PNP achieves state-of-the-art perceptual matching and accelerates training relative to DDSP with MSS. The work includes extensive ablations, a study of learning dynamics, reparameterization effects, and optimizer choices, and provides open-source code. Overall, PNP offers a scalable, geometry-aware pathway to incorporate rich perceptual objectives in PSM, highlighting domain-specific representations and adaptive training strategies for improved generalization and efficiency; future work points to domain adaptation for real-world data.

Abstract

Perceptual sound matching (PSM) aims to find the input parameters to a synthesizer so as to best imitate an audio target. Deep learning for PSM optimizes a neural network to analyze and reconstruct prerecorded samples. In this context, our article addresses the problem of designing a suitable loss function when the training set is generated by a differentiable synthesizer. Our main contribution is perceptual-neural-physical loss (PNP), which aims at addressing a tradeoff between perceptual relevance and computational efficiency. The key idea behind PNP is to linearize the effect of synthesis parameters upon auditory features in the vicinity of each training sample. The linearization procedure is massively paralellizable, can be precomputed, and offers a 100-fold speedup during gradient descent compared to differentiable digital signal processing (DDSP). We demonstrate PNP on two datasets of nonstationary sounds: an AM/FM arpeggiator and a physical model of rectangular membranes. We show that PNP is able to accelerate DDSP with joint time-frequency scattering transform (JTFS) as auditory feature, while preserving its perceptual fidelity. Additionally, we evaluate the impact of other design choices in PSM: parameter rescaling, pretraining, auditory representation, and gradient clipping. We report state-of-the-art results on both datasets and find that PNP-accelerated JTFS has greater influence on PSM performance than any other design choice.

Learning to Solve Inverse Problems for Perceptual Sound Matching

TL;DR

through a Riemannian metric

. An adaptive Levenberg–Marquardt–style damping term

stabilizes optimization when

is ill-conditioned, enabling a smooth transition from parameter-space loss to perceptual loss during training. The approach is instantiated with joint time--frequency scattering (JTFS) as the perceptual feature and evaluated on two differentiable nonstationary synthesizers: an AM/FM arpeggiator and a FTM-based rectangular membrane drum model; results show JTFS-based PNP achieves state-of-the-art perceptual matching and accelerates training relative to DDSP with MSS. The work includes extensive ablations, a study of learning dynamics, reparameterization effects, and optimizer choices, and provides open-source code. Overall, PNP offers a scalable, geometry-aware pathway to incorporate rich perceptual objectives in PSM, highlighting domain-specific representations and adaptive training strategies for improved generalization and efficiency; future work points to domain adaptation for real-world data.

Abstract

Paper Structure (29 sections, 27 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 29 sections, 27 equations, 3 figures, 2 tables, 1 algorithm.

Introduction
Motivation
Problem statement
Key idea
Contributions
Related Work
Differentiable Synthesizers
Perceptually motivated time--frequency representations
P-loss and Differentiable Spectral Loss
Methods
Taylor expansion
Riemannian geometry
Levenberg-Marquardt algorithm
Application
Joint time--frequency scattering transform ($\boldsymbol{\Phi}$)
...and 14 more sections

Figures (3)

Figure 1: Graphical outline of the proposed method. Given a known synthesizer $\boldsymbol{g}$ and feature map $\boldsymbol{\Phi}$, we train a neural network $\boldsymbol{f}_{\mathbf{W}}$ to estimate $\tilde{\boldsymbol{\theta}}$ and minimize the "perceptual--neural--physical" (PNP) quadratic form $\langle\tilde{\boldsymbol{\theta}} - \boldsymbol{\theta}\vert\mathbf{M}(\boldsymbol{\theta})\vert\tilde{\boldsymbol{\theta}} - \boldsymbol{\theta}\rangle$ where $\mathbf{M}$ is the Riemannian metric associated to $(\boldsymbol{\Phi} \circ \boldsymbol{g})$. Transformations in solid (resp. dashed) lines can (resp. cannot) be cached during training.
Figure 2: Ablation study of the proposed model. We demonstrate the deterioration of both JTFS metrics (left column) and MSS metrics (right column) under individual removal of one of various elements in the algorithm. For both metrics, lower values indicate a better perceptual accuracy. Bottom row is the proposed model detailed in Algorithm \ref{['alg:cap']}. Each row above shows the extent of degradation when disabling each mechanism. The use of JTFS as opposed to MSS is essential to model performance. Each model is randomly initialized with varying seeds and trained five times (red diamonds), with black vertical lines indicating the average metric across the trials.
Figure 3: Learning dynamics of P-loss. Pretraining with P-loss followed by finetuning with JTFS-based PNP loss and MSS loss. The curves show progression of three measures (average P-loss, MSS, and JTFS distance on validation set) over the course of 70 epochs training time. P-loss converges rapidly yet reaches plateau on all metrics. JTFS-based PNP finetuning keeps improving upon the two perceptual metrics without deteriorating on P-loss metrics. On the contrary, MSS finetuning fails to converge to an optimal solution after a spike in metrics due to a re-initialized learning rate.

Learning to Solve Inverse Problems for Perceptual Sound Matching

TL;DR

Abstract

Learning to Solve Inverse Problems for Perceptual Sound Matching

Authors

TL;DR

Abstract

Table of Contents

Figures (3)