Exploring Frequency-Domain Feature Modeling for HRTF Magnitude Upsampling
Xingyu Chen, Hanwen Bi, Fei Ma, Sipei Zhao, Eva Cheng, Ian S. Burnett
TL;DR
This work tackles the challenge of upsampling dense HRTFs from sparse measurements for personalized spatial audio by emphasizing explicit frequency-domain modeling of the log-magnitude spectrum $H_{\log}$. It introduces the FD-Conformer, a two-module sparse-to-dense network that sums spatial mapping with a frequency-domain Conformer, projecting a binaural spectral representation through Conformer blocks to capture local and long-range spectral dependencies; the model is trained with a combined LSD and spectral gradient loss. Across SONICOM and HUTUBS datasets, the FD-Conformer achieves state-of-the-art ILD and LSD, especially under extreme sparsity (e.g., 3–5 measurements), demonstrating the importance of frequency-aware design for robust HRTF magnitude upsampling. The approach offers practical impact for efficient, accurate personalized spatial audio with reduced measurement burden, and suggests future work on deeper integration of spectral and spatial modeling.
Abstract
Accurate upsampling of Head-Related Transfer Functions (HRTFs) from sparse measurements is crucial for personalized spatial audio rendering. Traditional interpolation methods, such as kernel-based weighting or basis function expansions, rely on measurements from a single subject and are limited by the spatial sampling theorem, resulting in significant performance degradation under sparse sampling. Recent learning-based methods alleviate this limitation by leveraging cross-subject information, yet most existing neural architectures primarily focus on modeling spatial relationships across directions, while spectral dependencies along the frequency dimension are often modeled implicitly or treated independently. However, HRTF magnitude responses exhibit strong local continuity and long-range structure in the frequency domain, which are not fully exploited. This work investigates frequency-domain feature modeling by examining how different architectural choices, ranging from per-frequency multilayer perceptrons to convolutional, dilated convolutional, and attention-based models, affect performance under varying sparsity levels, showing that explicit spectral modeling consistently improves reconstruction accuracy, particularly under severe sparsity. Motivated by this observation, a frequency-domain Conformer-based architecture is adopted to jointly capture local spectral continuity and long-range frequency correlations. Experimental results on the SONICOM and HUTUBS datasets demonstrate that the proposed method achieves state-of-the-art performance in terms of interaural level difference and log-spectral distortion.
