Gaussian Flow Bridges for Audio Domain Transfer with Unpaired Data

Eloi Moliner; Sebastian Braun; Hannes Gamper

Gaussian Flow Bridges for Audio Domain Transfer with Unpaired Data

Eloi Moliner, Sebastian Braun, Hannes Gamper

TL;DR

The paper addresses unsupervised audio domain transfer by introducing Gaussian Flow Bridges (GFBs), which chain two deterministic flows to transport samples between an input distribution and a conditioned target, guided by a continuous vector $\mathbf{c}$. Grounded in Continuous Normalizing Flows and Conditional Flow Matching, the approach leverages an encoder to map data to a Gaussian latent and a conditional decoder to generate conditioned outputs, aiming for near-optimal transport paths. A key contribution is the chunk-based minibatch optimal transport strategy, which reduces trajectory curvature and helps preserve speech content when training on high-dimensional waveform data. Experimental results on reverberation and distortion tasks demonstrate competitive performance and generalization to unseen speakers and acoustic conditions, while highlighting areas for improvement in content fidelity and artifact suppression.

Abstract

Audio domain transfer is the process of modifying audio signals to match characteristics of a different domain, while retaining the original content. This paper investigates the potential of Gaussian Flow Bridges, an emerging approach in generative modeling, for this problem. The presented framework addresses the transport problem across different distributions of audio signals through the implementation of a series of two deterministic probability flows. The proposed framework facilitates manipulation of the target distribution properties through a continuous control variable, which defines a certain aspect of the target domain. Notably, this approach does not rely on paired examples for training. To address identified challenges on maintaining the speech content consistent, we recommend a training strategy that incorporates chunk-based minibatch Optimal Transport couplings of data samples and noise. Comparing our unsupervised method with established baselines, we find competitive performance in tasks of reverberation and distortion manipulation. Despite encoutering limitations, the intriguing results obtained in this study underscore potential for further exploration.

Gaussian Flow Bridges for Audio Domain Transfer with Unpaired Data

TL;DR

. Grounded in Continuous Normalizing Flows and Conditional Flow Matching, the approach leverages an encoder to map data to a Gaussian latent and a conditional decoder to generate conditioned outputs, aiming for near-optimal transport paths. A key contribution is the chunk-based minibatch optimal transport strategy, which reduces trajectory curvature and helps preserve speech content when training on high-dimensional waveform data. Experimental results on reverberation and distortion tasks demonstrate competitive performance and generalization to unseen speakers and acoustic conditions, while highlighting areas for improvement in content fidelity and artifact suppression.

Abstract

Paper Structure (13 sections, 3 equations, 5 figures, 1 algorithm)

This paper contains 13 sections, 3 equations, 5 figures, 1 algorithm.

Introduction
Background
Continuous Normalizing Flows
Conditional Flow Matching
Methods
Gaussian Flow Bridges
Chunk-based minibatch optimal transport couplings
Experiments
Experimental setup
Coupling configurations and trajectory curvature analysis
Speech reverberation evaluation
Declipping evaluation
Conclusion

Figures (5)

Figure 1: (Top) Illustration of a GFB in one-dimensional space. (Middle) A sequential display of spectrograms, showcasing the stages of audio signal transformation. (Bottom) Geometrical interpretation highlighting the mapping of data points through encoding and decoding within a Gaussian space.
Figure 2: Averaged trajectory curvature with respect to time $\tau$ when different coupling strategies are used. The shaded area represents the 25% and 75% percentiles.
Figure 3: Scatter plots illustrating the trade-offs between SR-CS and WER versus T$_{60}$ and C$_{50}$ errors for models conditioned on specific acoustic features. Points represent aggregated test set results, highlighting the effects of chunk length (N$_\text{c}$) and CFG scale ($\gamma$).
Figure 4: Objective evaluation on speech dereverberation.
Figure 5: Objective evaluation on speech declipping.

Gaussian Flow Bridges for Audio Domain Transfer with Unpaired Data

TL;DR

Abstract

Gaussian Flow Bridges for Audio Domain Transfer with Unpaired Data

Authors

TL;DR

Abstract

Table of Contents

Figures (5)