Table of Contents
Fetching ...

Decomposing the Influence of Physical Acoustic Modeling on Neural Personal Sound Zone Rendering: An Ablation Study

Hao Jiang, Edgar Choueiri

TL;DR

A controlled ablation study on a head-pose-conditioned binaural PSZ renderer using the Binaural Spatial Audio Neural Network to guide prioritization of measurements and models when constructing training ATFs under limited budgets.

Abstract

Deep learning-based Personal Sound Zones (PSZs) rely on simulated acoustic transfer functions (ATFs) for training, yet idealized point-source models exhibit large sim-to-real gaps. While physically informed components improve generalization, individual contributions remain unclear. This paper presents a controlled ablation study on a head-pose-conditioned binaural PSZ renderer using the Binaural Spatial Audio Neural Network (BSANN). We progressively enrich simulated ATFs with three components: (i) anechoically measured frequency responses of the particular loudspeakers(FR), (ii) analytic circular-piston directivity (DIR), and (iii) rigid-sphere head-related transfer functions (RS-HRTF). Four configurations are evaluated via in-situ measurements with two dummy heads. Performance metrics include inter-zone isolation (IZI), inter-program interference (IPI), and crosstalk cancellation (XTC) over 100-20000 Hz. Results show FR provides spectral calibration, yielding modest XTC improvements and reduced inter-listener IPI imbalance. DIR delivers the most consistent sound-zone separation gains (10.05 dB average IZI/IPI). RS-HRTF dominates binaural separation, boosting XTC by +2.38/+2.89 dB (average 4.51 to 7.91 dB), primarily above 2 kHz, while introducing mild listener-dependent IZI/IPI shifts. These findings guide prioritization of measurements and models when constructing training ATFs under limited budgets.

Decomposing the Influence of Physical Acoustic Modeling on Neural Personal Sound Zone Rendering: An Ablation Study

TL;DR

A controlled ablation study on a head-pose-conditioned binaural PSZ renderer using the Binaural Spatial Audio Neural Network to guide prioritization of measurements and models when constructing training ATFs under limited budgets.

Abstract

Deep learning-based Personal Sound Zones (PSZs) rely on simulated acoustic transfer functions (ATFs) for training, yet idealized point-source models exhibit large sim-to-real gaps. While physically informed components improve generalization, individual contributions remain unclear. This paper presents a controlled ablation study on a head-pose-conditioned binaural PSZ renderer using the Binaural Spatial Audio Neural Network (BSANN). We progressively enrich simulated ATFs with three components: (i) anechoically measured frequency responses of the particular loudspeakers(FR), (ii) analytic circular-piston directivity (DIR), and (iii) rigid-sphere head-related transfer functions (RS-HRTF). Four configurations are evaluated via in-situ measurements with two dummy heads. Performance metrics include inter-zone isolation (IZI), inter-program interference (IPI), and crosstalk cancellation (XTC) over 100-20000 Hz. Results show FR provides spectral calibration, yielding modest XTC improvements and reduced inter-listener IPI imbalance. DIR delivers the most consistent sound-zone separation gains (10.05 dB average IZI/IPI). RS-HRTF dominates binaural separation, boosting XTC by +2.38/+2.89 dB (average 4.51 to 7.91 dB), primarily above 2 kHz, while introducing mild listener-dependent IZI/IPI shifts. These findings guide prioritization of measurements and models when constructing training ATFs under limited budgets.
Paper Structure (14 sections, 14 equations, 3 figures, 1 table)

This paper contains 14 sections, 14 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Geometric configuration of the experimental testbed. The playback system features a 24-element loudspeaker array mounted on a rigid baffle, comprising two rows: 8 woofers ($100$--$2000$ Hz) and 16 tweeters ($2$--$20$ kHz). Two HATS are positioned symmetrically at a distance of 1.0 m from the array with lateral offsets of $\pm 0.5$ m relative to the centerline, oriented perpendicular to the array midpoint.
  • Figure 2: Broadband performance summary for the four cumulative configurations. Top row: log-mean metric values over 100--20,000 Hz for Listener 1 and Listener 2 under C0--C3. Bottom row: incremental changes under the cumulative ablation protocol (FR: C1$-$C0; DIR: C2$-$C1; RS-HRTF: C3$-$C2).
  • Figure 3: Frequency-dependent IZI, IPI, and XTC for the four cumulative training configurations, evaluated at a single static pose. Legends correspond to C0 (baseline point-source simulation), C1 (+FR), C2 (+FR+DIR), and C3 (+FR+DIR+RS-HRTF). Top row: Listener 1. Bottom row: Listener 2.