SIREN: Spatially-Informed Reconstruction of Binaural Audio with Vision

Mingyeong Song; Seoyeon Ko; Junhyug Noh

SIREN: Spatially-Informed Reconstruction of Binaural Audio with Vision

Mingyeong Song, Seoyeon Ko, Junhyug Noh

Abstract

Binaural audio delivers spatial cues essential for immersion, yet most consumer videos are monaural due to capture constraints. We introduce SIREN, a visually guided mono to binaural framework that explicitly predicts left and right channels. A ViT-based encoder learns dual-head self-attention to produce a shared scene map and end-to-end L/R attention, replacing hand-crafted masks. A soft, annealed spatial prior gently biases early L/R grounding, and a two-stage, confidence-weighted waveform-domain fusion (guided by mono reconstruction and interaural phase consistency) suppresses crosstalk when aggregating multi-crop and overlapping windows. Evaluated on FAIR-Play and MUSIC-Stereo, SIREN yields consistent gains on time-frequency and phase-sensitive metrics with competitive SNR. The design is modular and generic, requires no task-specific annotations, and integrates with standard audio-visual pipelines.

SIREN: Spatially-Informed Reconstruction of Binaural Audio with Vision

Abstract

SIREN: Spatially-Informed Reconstruction of Binaural Audio with Vision

Abstract

Paper Structure

Table of Contents

Figures (1)