Table of Contents
Fetching ...

BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis

Yichong Leng, Zehua Chen, Junliang Guo, Haohe Liu, Jiawei Chen, Xu Tan, Danilo Mandic, Lei He, Xiang-Yang Li, Tao Qin, Sheng Zhao, Tie-Yan Liu

TL;DR

BinauralGrad addresses binaural audio synthesis from mono input by introducing a two-stage diffusion framework that separately models the common information shared by both ears and the ear-specific differences. The first stage generates a shared component $ar{y}$ conditioned on the mono signal, while the second stage uses a two-channel diffusion model to synthesize the left and right channels from the first-stage output. Experiments on a large binaural dataset demonstrate state-of-the-art objective metrics and superior MOS scores compared with baselines, validating the efficacy of the common/specific decomposition. The work advances immersive audio synthesis for VR/AR by leveraging diffusion models and provides code and samples for reproducibility.

Abstract

Binaural audio plays a significant role in constructing immersive augmented and virtual realities. As it is expensive to record binaural audio from the real world, synthesizing them from mono audio has attracted increasing attention. This synthesis process involves not only the basic physical warping of the mono audio, but also room reverberations and head/ear related filtrations, which, however, are difficult to accurately simulate in traditional digital signal processing. In this paper, we formulate the synthesis process from a different perspective by decomposing the binaural audio into a common part that shared by the left and right channels as well as a specific part that differs in each channel. Accordingly, we propose BinauralGrad, a novel two-stage framework equipped with diffusion models to synthesize them respectively. Specifically, in the first stage, the common information of the binaural audio is generated with a single-channel diffusion model conditioned on the mono audio, based on which the binaural audio is generated by a two-channel diffusion model in the second stage. Combining this novel perspective of two-stage synthesis with advanced generative models (i.e., the diffusion models),the proposed BinauralGrad is able to generate accurate and high-fidelity binaural audio samples. Experiment results show that on a benchmark dataset, BinauralGrad outperforms the existing baselines by a large margin in terms of both object and subject evaluation metrics (Wave L2: 0.128 vs. 0.157, MOS: 3.80 vs. 3.61). The generated audio samples (https://speechresearch.github.io/binauralgrad) and code (https://github.com/microsoft/NeuralSpeech/tree/master/BinauralGrad) are available online.

BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis

TL;DR

BinauralGrad addresses binaural audio synthesis from mono input by introducing a two-stage diffusion framework that separately models the common information shared by both ears and the ear-specific differences. The first stage generates a shared component conditioned on the mono signal, while the second stage uses a two-channel diffusion model to synthesize the left and right channels from the first-stage output. Experiments on a large binaural dataset demonstrate state-of-the-art objective metrics and superior MOS scores compared with baselines, validating the efficacy of the common/specific decomposition. The work advances immersive audio synthesis for VR/AR by leveraging diffusion models and provides code and samples for reproducibility.

Abstract

Binaural audio plays a significant role in constructing immersive augmented and virtual realities. As it is expensive to record binaural audio from the real world, synthesizing them from mono audio has attracted increasing attention. This synthesis process involves not only the basic physical warping of the mono audio, but also room reverberations and head/ear related filtrations, which, however, are difficult to accurately simulate in traditional digital signal processing. In this paper, we formulate the synthesis process from a different perspective by decomposing the binaural audio into a common part that shared by the left and right channels as well as a specific part that differs in each channel. Accordingly, we propose BinauralGrad, a novel two-stage framework equipped with diffusion models to synthesize them respectively. Specifically, in the first stage, the common information of the binaural audio is generated with a single-channel diffusion model conditioned on the mono audio, based on which the binaural audio is generated by a two-channel diffusion model in the second stage. Combining this novel perspective of two-stage synthesis with advanced generative models (i.e., the diffusion models),the proposed BinauralGrad is able to generate accurate and high-fidelity binaural audio samples. Experiment results show that on a benchmark dataset, BinauralGrad outperforms the existing baselines by a large margin in terms of both object and subject evaluation metrics (Wave L2: 0.128 vs. 0.157, MOS: 3.80 vs. 3.61). The generated audio samples (https://speechresearch.github.io/binauralgrad) and code (https://github.com/microsoft/NeuralSpeech/tree/master/BinauralGrad) are available online.
Paper Structure (24 sections, 10 equations, 6 figures, 5 tables)

This paper contains 24 sections, 10 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: An illustration of the proposed two-stage framework. $x$ indicates the mono audio emitted by the source object, $y=(y^l, y^r)$ indicate the binaural audio received by the left and right ears of the receiver, and $\bar{y}=\textrm{mean}(y^l, y^r)$ is the mono audio calculated by averaging the two channels in $y$. The common stage models the factor affecting both the left and right channel such as the distance of source and listener and the interaction of audio with room and head. And the specific stage models the difference of left ear and right ear including the marginal distance difference and the difference of impulse response of two ears.
  • Figure 2: The pipeline of the proposed framework while training. The difference between the two stages includes the condition signal $c$, the parameter set $\theta$, and utilizing one-/two-channel diffusion models in common/specific stages respectively.
  • Figure 3: An illustration of the model architecture. "FC" indicates the fully connected layer, "Conv" indicates the $1 \times 1$ convolution layer, and "Dilated Conv" indicates the bidirectional dilated convolution layer. The input represents the corrupted data sample $z_t$ at the $t$-th diffusion step, while the position and the conditional audio have been defined in Section \ref{['sec:prelim']}. We omit the activation functions for simplicity.
  • Figure 4: Case 1. Sub-figure (a) shows the waveform of mono audio, ground-truth (GT) binaural audio, synthesized binaural audio from BinauralGrad and other baseline systems. Sub-figure (b) shows the prediction error between synthesized audio and GT audio.
  • Figure 5: Case 2. Sub-figure (a) shows the waveform of mono audio, ground-truth (GT) binaural audio, synthesized binaural audio from BinauralGrad and other baseline systems. Sub-figure (b) shows the prediction error between synthesized audio and GT audio.
  • ...and 1 more figures