Interaural time difference loss for binaural target sound extraction
Carlos Hernandez-Olivan, Marc Delcroix, Tsubasa Ochiai, Naohiro Tawara, Tomohiro Nakatani, Shoko Araki
TL;DR
The paper tackles preserving binaural spatial cues while extracting a target sound from a binaural mixture. It introduces a multi-task training framework that combines conventional signal-level losses with explicit spatial losses (ILD, IPD, and a novel ITD loss) to better retain interaural cues. The key contribution is a differentiable ITD loss based on cross-correlation (GCC-PHAT) that directly targets ITD preservation and demonstrates superior performance across spatial metrics without sacrificing signal-level quality or increasing failure rates. Experiments on mixtures of 3–4 sounds from 20 classes show the ITD loss yielding the largest overall gains in spatial cue preservation, suggesting broad applicability to binaural speech/audio processing and hearable applications.
Abstract
Binaural target sound extraction (TSE) aims to extract a desired sound from a binaural mixture of arbitrary sounds while preserving the spatial cues of the desired sound. Indeed, for many applications, the target sound signal and its spatial cues carry important information about the sound source. Binaural TSE can be realized with a neural network trained to output only the desired sound given a binaural mixture and an embedding characterizing the desired sound class as inputs. Conventional TSE systems are trained using signal-level losses, which measure the difference between the extracted and reference signals for the left and right channels. In this paper, we propose adding explicit spatial losses to better preserve the spatial cues of the target sound. In particular, we explore losses aiming at preserving the interaural level (ILD), phase (IPD), and time differences (ITD). We show experimentally that adding such spatial losses, particularly our newly proposed ITD loss, helps preserve better spatial cues while maintaining the signal-level metrics.
