Table of Contents
Fetching ...

Audio signal interpolation using optimal transportation of spectrograms

David Valdivia, Marien Renaud, Elsa Cazelles, Cédric Févotte

TL;DR

We address interpolating audio by treating the time–frequency energy of source and target as discrete distributions μ^s and μ^t on a tf grid and computing their Wasserstein barycenter μ^α to yield an interpolant, which is then inverted to a time-domain signal via tf reassignment and phase reconstruction. The approach includes a baseline tf OT with cost matrix C and a refined unbalanced OT with a structured cost matrix to limit time displacements via a parameter p and mass variation via β. Interpolation is realized by assigning barycenter mass to a native tf grid and reconstructing the time signal through tf reassignment and Griffin–Lim phase recovery. Experiments on synthetic musical notes and environmental sounds demonstrate that unbalanced OT with time-limited transport preserves dynamics and texture, offering a scalable, perceptually faithful alternative to framewise OT and NMF-based schemes.

Abstract

We present a novel approach for generating an artificial audio signal that interpolates between given source and target sounds. Our approach relies on the computation of Wasserstein barycenters of the source and target spectrograms, followed by phase reconstruction and inversion. In contrast with previous works, our new method considers the spectrograms globally and does not operate on a temporal frame-to-frame basis. Another contribution is to endow the transportation cost matrix with a specific structure that prohibits remote displacements of energy along the time axis, and for which optimal transport is made possible by leveraging the unbalanced transport framework. The proposed cost matrix makes sense from the audio perspective and also allows to reduce the computation load. Results with synthetic musical notes and real environmental sounds illustrate the potential of our novel approach.

Audio signal interpolation using optimal transportation of spectrograms

TL;DR

We address interpolating audio by treating the time–frequency energy of source and target as discrete distributions μ^s and μ^t on a tf grid and computing their Wasserstein barycenter μ^α to yield an interpolant, which is then inverted to a time-domain signal via tf reassignment and phase reconstruction. The approach includes a baseline tf OT with cost matrix C and a refined unbalanced OT with a structured cost matrix to limit time displacements via a parameter p and mass variation via β. Interpolation is realized by assigning barycenter mass to a native tf grid and reconstructing the time signal through tf reassignment and Griffin–Lim phase recovery. Experiments on synthetic musical notes and environmental sounds demonstrate that unbalanced OT with time-limited transport preserves dynamics and texture, offering a scalable, perceptually faithful alternative to framewise OT and NMF-based schemes.

Abstract

We present a novel approach for generating an artificial audio signal that interpolates between given source and target sounds. Our approach relies on the computation of Wasserstein barycenters of the source and target spectrograms, followed by phase reconstruction and inversion. In contrast with previous works, our new method considers the spectrograms globally and does not operate on a temporal frame-to-frame basis. Another contribution is to endow the transportation cost matrix with a specific structure that prohibits remote displacements of energy along the time axis, and for which optimal transport is made possible by leveraging the unbalanced transport framework. The proposed cost matrix makes sense from the audio perspective and also allows to reduce the computation load. Results with synthetic musical notes and real environmental sounds illustrate the potential of our novel approach.

Paper Structure

This paper contains 16 sections, 8 equations, 2 figures.

Figures (2)

  • Figure 1: Interpolation between two versions of the same sequence of notes C3-G3 played on a piano and a guitar, using $\alpha=0.5$.
  • Figure 2: Interpolation between a cicada's chirp and flowing water, using $\alpha=0.5$. Various values of the time-limiting parameter $p$ are considered. In subplot (f) we use $p=+\infty$ as a shorthand for using $\bar{ {\mathbf{C}} }= {\mathbf{C}}$ (the time limit is lifted).

Theorems & Definitions (2)

  • Definition 1
  • Definition 2