Table of Contents
Fetching ...

Zero-Shot Mono-to-Binaural Speech Synthesis

Alon Levkovitch, Julian Salazar, Soroosh Mariooryad, RJ Skerry-Ryan, Nadav Bar, Bastiaan Kleijn, Eliya Nachmani

TL;DR

ZeroBAS tackles mono-to-binaural speech synthesis without binaural training data by combining a parameter-free geometric time warp with inverse-square amplitude scaling and iterative refinement through a pretrained denoising vocoder. The three-stage pipeline generates binaural outputs conditioned on source and listener geometry, achieving perceptual parity with supervised methods on in-distribution data and superior performance on out-of-distribution scenarios via a newly constructed TUT Mono-to-Binaural dataset. The approach demonstrates the potential of pretrained generative audio models and zero-shot learning to produce robust spatial audio across room conditions, without explicit room or head-related transfer function modeling. The work introduces a principled dataset construction method and shows that zero-shot methods can outperform supervised approaches when generalization is required, with practical implications for AR/VR audio realism and deployment simplicity.

Abstract

We present ZeroBAS, a neural method to synthesize binaural audio from monaural audio recordings and positional information without training on any binaural data. To our knowledge, this is the first published zero-shot neural approach to mono-to-binaural audio synthesis. Specifically, we show that a parameter-free geometric time warping and amplitude scaling based on source location suffices to get an initial binaural synthesis that can be refined by iteratively applying a pretrained denoising vocoder. Furthermore, we find this leads to generalization across room conditions, which we measure by introducing a new dataset, TUT Mono-to-Binaural, to evaluate state-of-the-art monaural-to-binaural synthesis methods on unseen conditions. Our zero-shot method is perceptually on-par with the performance of supervised methods on the standard mono-to-binaural dataset, and even surpasses them on our out-of-distribution TUT Mono-to-Binaural dataset. Our results highlight the potential of pretrained generative audio models and zero-shot learning to unlock robust binaural audio synthesis.

Zero-Shot Mono-to-Binaural Speech Synthesis

TL;DR

ZeroBAS tackles mono-to-binaural speech synthesis without binaural training data by combining a parameter-free geometric time warp with inverse-square amplitude scaling and iterative refinement through a pretrained denoising vocoder. The three-stage pipeline generates binaural outputs conditioned on source and listener geometry, achieving perceptual parity with supervised methods on in-distribution data and superior performance on out-of-distribution scenarios via a newly constructed TUT Mono-to-Binaural dataset. The approach demonstrates the potential of pretrained generative audio models and zero-shot learning to produce robust spatial audio across room conditions, without explicit room or head-related transfer function modeling. The work introduces a principled dataset construction method and shows that zero-shot methods can outperform supervised approaches when generalization is required, with practical implications for AR/VR audio realism and deployment simplicity.

Abstract

We present ZeroBAS, a neural method to synthesize binaural audio from monaural audio recordings and positional information without training on any binaural data. To our knowledge, this is the first published zero-shot neural approach to mono-to-binaural audio synthesis. Specifically, we show that a parameter-free geometric time warping and amplitude scaling based on source location suffices to get an initial binaural synthesis that can be refined by iteratively applying a pretrained denoising vocoder. Furthermore, we find this leads to generalization across room conditions, which we measure by introducing a new dataset, TUT Mono-to-Binaural, to evaluate state-of-the-art monaural-to-binaural synthesis methods on unseen conditions. Our zero-shot method is perceptually on-par with the performance of supervised methods on the standard mono-to-binaural dataset, and even surpasses them on our out-of-distribution TUT Mono-to-Binaural dataset. Our results highlight the potential of pretrained generative audio models and zero-shot learning to unlock robust binaural audio synthesis.

Paper Structure

This paper contains 14 sections, 4 equations, 2 figures, 3 tables, 1 algorithm.

Figures (2)

  • Figure 1: Our proposed ZeroBAS method. Mono waveform is binauralized with geometric time warping conditional on the speaker's position, then the two channels' amplitudes are scaled. Each channel is then denoised 3 times a monaural denoising vocoder.
  • Figure 2: MUSHRA results for (a) the BSD and (b) the TMB.