Lightweight Implicit Neural Network for Binaural Audio Synthesis
Xikun Lu, Fang Liu, Weizhi Shi, Jinqiu Sang
TL;DR
This paper tackles edge-device binaural audio synthesis by introducing Lite-INN, a lightweight two-stage framework that combines Time-Domain Warping (TDW) with an Implicit Binaural Corrector (IBC) to refine spectral details. The IBC expresses spectral corrections as a continuous function over spatio-temporal-frequency coordinates, using a compact MLP to predict a complex gain $G(t,f)$ that modulates STFT magnitudes and phases. Empirical results show Lite-INN achieves perceptual quality comparable to WaveNet while dramatically reducing parameters and compute relative to state-of-the-art baselines, enabling real-time, high-fidelity spatial audio on constrained devices. The approach highlights the viability of implicit neural representations for efficient, high-quality binaural synthesis and offers practical benefits for edge-device audio applications.
Abstract
High-fidelity binaural audio synthesis is crucial for immersive listening, but existing methods require extensive computational resources, limiting their edge-device application. To address this, we propose the Lightweight Implicit Neural Network (Lite-INN), a novel two-stage framework. Lite-INN first generates initial estimates using a time-domain warping, which is then refined by an Implicit Binaural Corrector (IBC) module. IBC is an implicit neural network that predicts amplitude and phase corrections directly, resulting in a highly compact model architecture. Experimental results show that Lite-INN achieves statistically comparable perceptual quality to the best-performing baseline model while significantly improving computational efficiency. Compared to the previous state-of-the-art method (NFS), Lite-INN achieves a 72.7% reduction in parameters and requires significantly fewer compute operations (MACs). This demonstrates that our approach effectively addresses the trade-off between synthesis quality and computational efficiency, providing a new solution for high-fidelity edge-device spatial audio applications.
