Table of Contents
Fetching ...

BinauralFlow: A Causal and Streamable Approach for High-Quality Binaural Speech Synthesis with Flow Matching Models

Susan Liang, Dejan Markovic, Israel D. Gebru, Steven Krenn, Todd Keebler, Jacob Sandakly, Frank Yu, Samuel Hassel, Chenliang Xu, Alexander Richard

TL;DR

This work addresses high-fidelity binaural speech synthesis with streaming inference by reframing binaural rendering as a generative task. It introduces BinauralFlow, a conditional flow matching framework conditioned on mono input and speaker/listener poses, paired with a causal U-Net backbone and a continuous inference pipeline. The system employs streaming STFT/ISTFT, a buffer bank, a midpoint solver, and an early skip schedule to enable low-latency, continuous generation, achieving superior quantitative metrics and perceptual realism (notably a 42% ABX confusion rate) while enabling real-time operation. The combination of conditional flow matching, causality-aware architecture, and streaming inference yields strong generalization and practical impact for immersive spatial audio in VR/AR, gaming, and interactive media.

Abstract

Binaural rendering aims to synthesize binaural audio that mimics natural hearing based on a mono audio and the locations of the speaker and listener. Although many methods have been proposed to solve this problem, they struggle with rendering quality and streamable inference. Synthesizing high-quality binaural audio that is indistinguishable from real-world recordings requires precise modeling of binaural cues, room reverb, and ambient sounds. Additionally, real-world applications demand streaming inference. To address these challenges, we propose a flow matching based streaming binaural speech synthesis framework called BinauralFlow. We consider binaural rendering to be a generation problem rather than a regression problem and design a conditional flow matching model to render high-quality audio. Moreover, we design a causal U-Net architecture that estimates the current audio frame solely based on past information to tailor generative models for streaming inference. Finally, we introduce a continuous inference pipeline incorporating streaming STFT/ISTFT operations, a buffer bank, a midpoint solver, and an early skip schedule to improve rendering continuity and speed. Quantitative and qualitative evaluations demonstrate the superiority of our method over SOTA approaches. A perceptual study further reveals that our model is nearly indistinguishable from real-world recordings, with a $42\%$ confusion rate.

BinauralFlow: A Causal and Streamable Approach for High-Quality Binaural Speech Synthesis with Flow Matching Models

TL;DR

This work addresses high-fidelity binaural speech synthesis with streaming inference by reframing binaural rendering as a generative task. It introduces BinauralFlow, a conditional flow matching framework conditioned on mono input and speaker/listener poses, paired with a causal U-Net backbone and a continuous inference pipeline. The system employs streaming STFT/ISTFT, a buffer bank, a midpoint solver, and an early skip schedule to enable low-latency, continuous generation, achieving superior quantitative metrics and perceptual realism (notably a 42% ABX confusion rate) while enabling real-time operation. The combination of conditional flow matching, causality-aware architecture, and streaming inference yields strong generalization and practical impact for immersive spatial audio in VR/AR, gaming, and interactive media.

Abstract

Binaural rendering aims to synthesize binaural audio that mimics natural hearing based on a mono audio and the locations of the speaker and listener. Although many methods have been proposed to solve this problem, they struggle with rendering quality and streamable inference. Synthesizing high-quality binaural audio that is indistinguishable from real-world recordings requires precise modeling of binaural cues, room reverb, and ambient sounds. Additionally, real-world applications demand streaming inference. To address these challenges, we propose a flow matching based streaming binaural speech synthesis framework called BinauralFlow. We consider binaural rendering to be a generation problem rather than a regression problem and design a conditional flow matching model to render high-quality audio. Moreover, we design a causal U-Net architecture that estimates the current audio frame solely based on past information to tailor generative models for streaming inference. Finally, we introduce a continuous inference pipeline incorporating streaming STFT/ISTFT operations, a buffer bank, a midpoint solver, and an early skip schedule to improve rendering continuity and speed. Quantitative and qualitative evaluations demonstrate the superiority of our method over SOTA approaches. A perceptual study further reveals that our model is nearly indistinguishable from real-world recordings, with a confusion rate.

Paper Structure

This paper contains 28 sections, 8 equations, 12 figures, 7 tables, 2 algorithms.

Figures (12)

  • Figure 1: Overview of our BinauralFlow framework. (a) shows the causal U-Net architecture. Our causal U-Net takes as input the flow $\phi_t(\mathbf{z})$ as well as four conditions $t$, $p_\mathrm{rx}$, $p_\mathrm{tx}$, and $\mathbf{x}$, and outputs a predicted vector field. The U-Net consists of several Causal 2D Conv Blocks in the contracting and expanding parts. (b) displays the Causal 2D Conv Block. We design fully causal convolution, down/up-sampling, and normalization layers to ensure temporal causality.
  • Figure 2: Continuous inference pipeline. Starting with a mono audio chunk (top left, black solid-line box), we compute its spectrogram via streaming STFT, add noise, and duplicate the channel to form the noisy spectrogram $\phi_0(\mathbf{z})$. The trained model progressively removes the noise with a buffer bank. Finally, streaming ISTFT converts the predicted binaural spectrogram $\phi_1(\mathbf{z})$ into binaural audio. When the next audio chunk appears (black dashed-line box), we repeat the process and synthesize seamlessly continuous binaural speech.
  • Figure 3: The early skip time schedule. The use of an early skip strategy effectively reduces the inference steps and retains the generation performance.
  • Figure 4: Performance with respect to the NFE. We evaluate all generative models using the same NFE for a fair comparison.
  • Figure 5: Qualitative comparison between different baselines. We display waveforms of rendered spatial audio.
  • ...and 7 more figures