Table of Contents
Fetching ...

HRTF upsampling with a generative adversarial network using a gnomonic equiangular projection

Aidan O. T. Hogg, Mads Jenkins, He Liu, Isaac Squires, Samuel J. Cooper, Lorenzo Picinali

TL;DR

This work addresses the challenge of obtaining individualized HRTFs for realistic VR/AR audio by proposing a generative adversarial network (SRGAN) framework that upscales sparse HRTF measurements. A gnomonic equiangular (cubed-sphere) projection converts spherical HRIR data into a CNN-friendly 2D representation, enabling 3D upsampling across the sphere. The generator is trained with a content loss combining LSD and ILD alongside an adversarial loss, while post-processing reconstructs phase and ITD using a minimum-phase approach and a simple ITD model. Empirical results show the SRGAN outperforms barycentric interpolation and spherical harmonics when the input is very sparse (≤20 positions), with perceptual localisation metrics corroborating improvements, highlighting practical impact for low-cost HRTF acquisition. The work advances open-source tools for fast, personalized spatial audio by delivering high-quality HRTFs from limited measurements and sets a clear direction for incorporating perceptual losses and phase information in future iterations.

Abstract

An individualised head-related transfer function (HRTF) is very important for creating realistic virtual reality (VR) and augmented reality (AR) environments. However, acoustically measuring high-quality HRTFs requires expensive equipment and an acoustic lab setting. To overcome these limitations and to make this measurement more efficient HRTF upsampling has been exploited in the past where a high-resolution HRTF is created from a low-resolution one. This paper demonstrates how generative adversarial networks (GANs) can be applied to HRTF upsampling. We propose a novel approach that transforms the HRTF data for direct use with a convolutional super-resolution generative adversarial network (SRGAN). This new approach is benchmarked against three baselines: barycentric upsampling, spherical harmonic (SH) upsampling and an HRTF selection approach. Experimental results show that the proposed method outperforms all three baselines in terms of log-spectral distortion (LSD) and localisation performance using perceptual models when the input HRTF is sparse (less than 20 measured positions).

HRTF upsampling with a generative adversarial network using a gnomonic equiangular projection

TL;DR

This work addresses the challenge of obtaining individualized HRTFs for realistic VR/AR audio by proposing a generative adversarial network (SRGAN) framework that upscales sparse HRTF measurements. A gnomonic equiangular (cubed-sphere) projection converts spherical HRIR data into a CNN-friendly 2D representation, enabling 3D upsampling across the sphere. The generator is trained with a content loss combining LSD and ILD alongside an adversarial loss, while post-processing reconstructs phase and ITD using a minimum-phase approach and a simple ITD model. Empirical results show the SRGAN outperforms barycentric interpolation and spherical harmonics when the input is very sparse (≤20 positions), with perceptual localisation metrics corroborating improvements, highlighting practical impact for low-cost HRTF acquisition. The work advances open-source tools for fast, personalized spatial audio by delivering high-quality HRTFs from limited measurements and sets a clear direction for incorporating perceptual losses and phase information in future iterations.

Abstract

An individualised head-related transfer function (HRTF) is very important for creating realistic virtual reality (VR) and augmented reality (AR) environments. However, acoustically measuring high-quality HRTFs requires expensive equipment and an acoustic lab setting. To overcome these limitations and to make this measurement more efficient HRTF upsampling has been exploited in the past where a high-resolution HRTF is created from a low-resolution one. This paper demonstrates how generative adversarial networks (GANs) can be applied to HRTF upsampling. We propose a novel approach that transforms the HRTF data for direct use with a convolutional super-resolution generative adversarial network (SRGAN). This new approach is benchmarked against three baselines: barycentric upsampling, spherical harmonic (SH) upsampling and an HRTF selection approach. Experimental results show that the proposed method outperforms all three baselines in terms of log-spectral distortion (LSD) and localisation performance using perceptual models when the input HRTF is sparse (less than 20 measured positions).
Paper Structure (30 sections, 27 equations, 9 figures, 3 tables)

This paper contains 30 sections, 27 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: The four steps that project the IR locations for a given HRTF in the ARI dataset onto a flattened uniform cube.
  • Figure 2: Each face of the gnomonic equiangular projection (in green) is padded with data from the adjacent faces (in red). This is displayed both for the 3D cube (a) and for the flattened 2D surface (b). In the corner, the value is ambiguous, therefore, values are taken from the top panel Weyn2020.
  • Figure 3: The architecture of the discriminator and generator networks, where each convolutional layer contains $k$ kernels, $n$ feature layers, and $s$ stride. Acronyms: LReLU, PReLU.
  • Figure 4: The source positions for each downsampling factor.
  • Figure 5: Illustrative example of overall loss curves for 20 $\,\rightarrow$ 1280 network.
  • ...and 4 more figures