Zero-Shot Mono-to-Binaural Speech Synthesis
Alon Levkovitch, Julian Salazar, Soroosh Mariooryad, RJ Skerry-Ryan, Nadav Bar, Bastiaan Kleijn, Eliya Nachmani
TL;DR
ZeroBAS tackles mono-to-binaural speech synthesis without binaural training data by combining a parameter-free geometric time warp with inverse-square amplitude scaling and iterative refinement through a pretrained denoising vocoder. The three-stage pipeline generates binaural outputs conditioned on source and listener geometry, achieving perceptual parity with supervised methods on in-distribution data and superior performance on out-of-distribution scenarios via a newly constructed TUT Mono-to-Binaural dataset. The approach demonstrates the potential of pretrained generative audio models and zero-shot learning to produce robust spatial audio across room conditions, without explicit room or head-related transfer function modeling. The work introduces a principled dataset construction method and shows that zero-shot methods can outperform supervised approaches when generalization is required, with practical implications for AR/VR audio realism and deployment simplicity.
Abstract
We present ZeroBAS, a neural method to synthesize binaural audio from monaural audio recordings and positional information without training on any binaural data. To our knowledge, this is the first published zero-shot neural approach to mono-to-binaural audio synthesis. Specifically, we show that a parameter-free geometric time warping and amplitude scaling based on source location suffices to get an initial binaural synthesis that can be refined by iteratively applying a pretrained denoising vocoder. Furthermore, we find this leads to generalization across room conditions, which we measure by introducing a new dataset, TUT Mono-to-Binaural, to evaluate state-of-the-art monaural-to-binaural synthesis methods on unseen conditions. Our zero-shot method is perceptually on-par with the performance of supervised methods on the standard mono-to-binaural dataset, and even surpasses them on our out-of-distribution TUT Mono-to-Binaural dataset. Our results highlight the potential of pretrained generative audio models and zero-shot learning to unlock robust binaural audio synthesis.
