HRTF upsampling with a generative adversarial network using a gnomonic equiangular projection
Aidan O. T. Hogg, Mads Jenkins, He Liu, Isaac Squires, Samuel J. Cooper, Lorenzo Picinali
TL;DR
This work addresses the challenge of obtaining individualized HRTFs for realistic VR/AR audio by proposing a generative adversarial network (SRGAN) framework that upscales sparse HRTF measurements. A gnomonic equiangular (cubed-sphere) projection converts spherical HRIR data into a CNN-friendly 2D representation, enabling 3D upsampling across the sphere. The generator is trained with a content loss combining LSD and ILD alongside an adversarial loss, while post-processing reconstructs phase and ITD using a minimum-phase approach and a simple ITD model. Empirical results show the SRGAN outperforms barycentric interpolation and spherical harmonics when the input is very sparse (≤20 positions), with perceptual localisation metrics corroborating improvements, highlighting practical impact for low-cost HRTF acquisition. The work advances open-source tools for fast, personalized spatial audio by delivering high-quality HRTFs from limited measurements and sets a clear direction for incorporating perceptual losses and phase information in future iterations.
Abstract
An individualised head-related transfer function (HRTF) is very important for creating realistic virtual reality (VR) and augmented reality (AR) environments. However, acoustically measuring high-quality HRTFs requires expensive equipment and an acoustic lab setting. To overcome these limitations and to make this measurement more efficient HRTF upsampling has been exploited in the past where a high-resolution HRTF is created from a low-resolution one. This paper demonstrates how generative adversarial networks (GANs) can be applied to HRTF upsampling. We propose a novel approach that transforms the HRTF data for direct use with a convolutional super-resolution generative adversarial network (SRGAN). This new approach is benchmarked against three baselines: barycentric upsampling, spherical harmonic (SH) upsampling and an HRTF selection approach. Experimental results show that the proposed method outperforms all three baselines in terms of log-spectral distortion (LSD) and localisation performance using perceptual models when the input HRTF is sparse (less than 20 measured positions).
