Table of Contents
Fetching ...

Lightweight Implicit Neural Network for Binaural Audio Synthesis

Xikun Lu, Fang Liu, Weizhi Shi, Jinqiu Sang

TL;DR

This paper tackles edge-device binaural audio synthesis by introducing Lite-INN, a lightweight two-stage framework that combines Time-Domain Warping (TDW) with an Implicit Binaural Corrector (IBC) to refine spectral details. The IBC expresses spectral corrections as a continuous function over spatio-temporal-frequency coordinates, using a compact MLP to predict a complex gain $G(t,f)$ that modulates STFT magnitudes and phases. Empirical results show Lite-INN achieves perceptual quality comparable to WaveNet while dramatically reducing parameters and compute relative to state-of-the-art baselines, enabling real-time, high-fidelity spatial audio on constrained devices. The approach highlights the viability of implicit neural representations for efficient, high-quality binaural synthesis and offers practical benefits for edge-device audio applications.

Abstract

High-fidelity binaural audio synthesis is crucial for immersive listening, but existing methods require extensive computational resources, limiting their edge-device application. To address this, we propose the Lightweight Implicit Neural Network (Lite-INN), a novel two-stage framework. Lite-INN first generates initial estimates using a time-domain warping, which is then refined by an Implicit Binaural Corrector (IBC) module. IBC is an implicit neural network that predicts amplitude and phase corrections directly, resulting in a highly compact model architecture. Experimental results show that Lite-INN achieves statistically comparable perceptual quality to the best-performing baseline model while significantly improving computational efficiency. Compared to the previous state-of-the-art method (NFS), Lite-INN achieves a 72.7% reduction in parameters and requires significantly fewer compute operations (MACs). This demonstrates that our approach effectively addresses the trade-off between synthesis quality and computational efficiency, providing a new solution for high-fidelity edge-device spatial audio applications.

Lightweight Implicit Neural Network for Binaural Audio Synthesis

TL;DR

This paper tackles edge-device binaural audio synthesis by introducing Lite-INN, a lightweight two-stage framework that combines Time-Domain Warping (TDW) with an Implicit Binaural Corrector (IBC) to refine spectral details. The IBC expresses spectral corrections as a continuous function over spatio-temporal-frequency coordinates, using a compact MLP to predict a complex gain that modulates STFT magnitudes and phases. Empirical results show Lite-INN achieves perceptual quality comparable to WaveNet while dramatically reducing parameters and compute relative to state-of-the-art baselines, enabling real-time, high-fidelity spatial audio on constrained devices. The approach highlights the viability of implicit neural representations for efficient, high-quality binaural synthesis and offers practical benefits for edge-device audio applications.

Abstract

High-fidelity binaural audio synthesis is crucial for immersive listening, but existing methods require extensive computational resources, limiting their edge-device application. To address this, we propose the Lightweight Implicit Neural Network (Lite-INN), a novel two-stage framework. Lite-INN first generates initial estimates using a time-domain warping, which is then refined by an Implicit Binaural Corrector (IBC) module. IBC is an implicit neural network that predicts amplitude and phase corrections directly, resulting in a highly compact model architecture. Experimental results show that Lite-INN achieves statistically comparable perceptual quality to the best-performing baseline model while significantly improving computational efficiency. Compared to the previous state-of-the-art method (NFS), Lite-INN achieves a 72.7% reduction in parameters and requires significantly fewer compute operations (MACs). This demonstrates that our approach effectively addresses the trade-off between synthesis quality and computational efficiency, providing a new solution for high-fidelity edge-device spatial audio applications.

Paper Structure

This paper contains 15 sections, 6 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The proposed Lite-INN architecture, a two-stage process combining a Time-Domain Warping (TDW) network for initial synthesis with an Implicit Binaural Corrector (IBC) for spectral refinement.
  • Figure 2: Violin plots of the MOS listening tests for (a) MOS-Q, (b) MOS-S, and (c) MOS-Sim. Statistical significance was determined by pairwise Wilcoxon Signed-Rank Tests comparing our model (Lite-INN) to each baseline (* indicates $p < 0.05$, ** indicates $p < 0.001$).
  • Figure 3: Plot of $\Delta\log A$ and $\Delta\phi$ averaged over the covering frequencies for the channel with the most dominant intensity.