Table of Contents
Fetching ...

SIREN: Spatially-Informed Reconstruction of Binaural Audio with Vision

Mingyeong Song, Seoyeon Ko, Junhyug Noh

Abstract

Binaural audio delivers spatial cues essential for immersion, yet most consumer videos are monaural due to capture constraints. We introduce SIREN, a visually guided mono to binaural framework that explicitly predicts left and right channels. A ViT-based encoder learns dual-head self-attention to produce a shared scene map and end-to-end L/R attention, replacing hand-crafted masks. A soft, annealed spatial prior gently biases early L/R grounding, and a two-stage, confidence-weighted waveform-domain fusion (guided by mono reconstruction and interaural phase consistency) suppresses crosstalk when aggregating multi-crop and overlapping windows. Evaluated on FAIR-Play and MUSIC-Stereo, SIREN yields consistent gains on time-frequency and phase-sensitive metrics with competitive SNR. The design is modular and generic, requires no task-specific annotations, and integrates with standard audio-visual pipelines.

SIREN: Spatially-Informed Reconstruction of Binaural Audio with Vision

Abstract

Binaural audio delivers spatial cues essential for immersion, yet most consumer videos are monaural due to capture constraints. We introduce SIREN, a visually guided mono to binaural framework that explicitly predicts left and right channels. A ViT-based encoder learns dual-head self-attention to produce a shared scene map and end-to-end L/R attention, replacing hand-crafted masks. A soft, annealed spatial prior gently biases early L/R grounding, and a two-stage, confidence-weighted waveform-domain fusion (guided by mono reconstruction and interaural phase consistency) suppresses crosstalk when aggregating multi-crop and overlapping windows. Evaluated on FAIR-Play and MUSIC-Stereo, SIREN yields consistent gains on time-frequency and phase-sensitive metrics with competitive SNR. The design is modular and generic, requires no task-specific annotations, and integrates with standard audio-visual pipelines.

Paper Structure

This paper contains 12 sections, 15 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Overview. (a, b) SIREN takes mono STFT $S_M$ and video $V$; a ViT with dual-head attention produces a shared map $v$ and L/R features that FiLM-condition an audio U-Net to predict $\hat{S}_D$ and $(\hat{S}_L,\hat{S}_R)$. (c) At test time, two-stage confidence-weighted refinement fuses crop-wise and overlap-wise waveforms (post-iSTFT) using mono and interaural-phase scores.