Table of Contents
Fetching ...

BiFormer3D: Grid-Free Time-Domain Reconstruction of Head-Related Impulse Responses with a Spatially Encoded Transformer

Shaoheng Xu, Chunyi Sun, Jihui Zhang, Amy Bastine, Prasanga N. Samarasinghe, Thushara D. Abhayapala, Hongdong Li

Abstract

Individualized head-related impulse responses (HRIRs) enable binaural rendering, but dense per-listener measurements are costly. We address HRIR spatial up-sampling from sparse per-listener measurements: given a few measured HRIRs for a listener, predict HRIRs at unmeasured target directions. Prior learning methods often work in the frequency domain, rely on minimum-phase assumptions or separate timing models, and use a fixed direction grid, which can degrade temporal fidelity and spatial continuity. We propose BiFormer3D, a time-domain, grid-free binaural Transformer for reconstructing HRIRs at arbitrary directions from sparse inputs. It uses sinusoidal spatial features, a Conv1D refinement module, and auxiliary interaural time difference (ITD) and interaural level difference (ILD) heads. On SONICOM, it improves normalized mean squared error (NMSE), cosine distance, and ITD/ILD errors over prior methods; ablations validate modules and show minimum-phase pre-processing is unnecessary.

BiFormer3D: Grid-Free Time-Domain Reconstruction of Head-Related Impulse Responses with a Spatially Encoded Transformer

Abstract

Individualized head-related impulse responses (HRIRs) enable binaural rendering, but dense per-listener measurements are costly. We address HRIR spatial up-sampling from sparse per-listener measurements: given a few measured HRIRs for a listener, predict HRIRs at unmeasured target directions. Prior learning methods often work in the frequency domain, rely on minimum-phase assumptions or separate timing models, and use a fixed direction grid, which can degrade temporal fidelity and spatial continuity. We propose BiFormer3D, a time-domain, grid-free binaural Transformer for reconstructing HRIRs at arbitrary directions from sparse inputs. It uses sinusoidal spatial features, a Conv1D refinement module, and auxiliary interaural time difference (ITD) and interaural level difference (ILD) heads. On SONICOM, it improves normalized mean squared error (NMSE), cosine distance, and ITD/ILD errors over prior methods; ablations validate modules and show minimum-phase pre-processing is unnecessary.

Paper Structure

This paper contains 11 sections, 13 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Illustration of HRIR spatial up-sampling: estimating HRIRs at unmeasured target directions (white) from a sparse set of measured directions (red).
  • Figure 2: Overview of the proposed BiFormer3D pipeline. Known HRIRs are projected into a latent space using an MLP-based signal encoder, while geometric embeddings derived from direction coordinates are independently projected and added to the signal features. The resulting tokens are processed jointly by a Transformer encoder to capture global spatial dependencies across measured and target directions. A shared MLP decoder maps contextual features to full-length binaural HRIRs for all directions, followed by masked fusion to preserve measured responses and a lightweight Conv1D for temporal refinement.
  • Figure 3: Left-ear HRIR reconstruction for subject P0187 at $(\phi,\theta,r)=(90^\circ,20^\circ,1.5~\mathrm{m})$ under $M=19$: estimated HRIR (dashed) versus ground-truth HRIR (solid). Angular distance to the nearest measured direction: $35.2^\circ$.
  • Figure 4: Stacked binaural HRIRs for subject P0187 under $M=19$ over 793 directions (each column: one direction; left/right ears concatenated along time). Ground truth (left) and estimation (right).