Table of Contents
Fetching ...

A Machine Learning Approach for Denoising and Upsampling HRTFs

Xuyi Hu, Jian Li, Lorenzo Picinali, Aidan O. T. Hogg

TL;DR

This work tackles the challenge of obtaining personalized HRTFs under noisy and time-consuming measurement conditions by proposing a denoising-then-upsampling framework, HRTF-DUNet. The system denoises SH-domain coefficients with a Denoisy U-Net and then upsamples via AE-GAN, enabling accurate reconstruction from as few as three measurements. On the SONICOM dataset, it achieves strong denoising performance (cosine similarity loss around $0.0070$) and LSD near $5.41$ dB under high sparsity, indicating practical gains for real-world immersive audio. By integrating denoising with learned upsampling, this approach can substantially reduce measurement burden while preserving critical binaural cues for spatial perception.

Abstract

The demand for realistic virtual immersive audio continues to grow, with Head-Related Transfer Functions (HRTFs) playing a key role. HRTFs capture how sound reaches our ears, reflecting unique anatomical features and enhancing spatial perception. It has been shown that personalized HRTFs improve localization accuracy, but their measurement remains time-consuming and requires a noise-free environment. Although machine learning has been shown to reduce the required measurement points and, thus, the measurement time, a controlled environment is still necessary. This paper proposes a method to address this constraint by presenting a novel technique that can upsample sparse, noisy HRTF measurements. The proposed approach combines an HRTF Denoisy U-Net for denoising and an Autoencoding Generative Adversarial Network (AE-GAN) for upsampling from three measurement points. The proposed method achieves a log-spectral distortion (LSD) error of 5.41 dB and a cosine similarity loss of 0.0070, demonstrating the method's effectiveness in HRTF upsampling.

A Machine Learning Approach for Denoising and Upsampling HRTFs

TL;DR

This work tackles the challenge of obtaining personalized HRTFs under noisy and time-consuming measurement conditions by proposing a denoising-then-upsampling framework, HRTF-DUNet. The system denoises SH-domain coefficients with a Denoisy U-Net and then upsamples via AE-GAN, enabling accurate reconstruction from as few as three measurements. On the SONICOM dataset, it achieves strong denoising performance (cosine similarity loss around ) and LSD near dB under high sparsity, indicating practical gains for real-world immersive audio. By integrating denoising with learned upsampling, this approach can substantially reduce measurement burden while preserving critical binaural cues for spatial perception.

Abstract

The demand for realistic virtual immersive audio continues to grow, with Head-Related Transfer Functions (HRTFs) playing a key role. HRTFs capture how sound reaches our ears, reflecting unique anatomical features and enhancing spatial perception. It has been shown that personalized HRTFs improve localization accuracy, but their measurement remains time-consuming and requires a noise-free environment. Although machine learning has been shown to reduce the required measurement points and, thus, the measurement time, a controlled environment is still necessary. This paper proposes a method to address this constraint by presenting a novel technique that can upsample sparse, noisy HRTF measurements. The proposed approach combines an HRTF Denoisy U-Net for denoising and an Autoencoding Generative Adversarial Network (AE-GAN) for upsampling from three measurement points. The proposed method achieves a log-spectral distortion (LSD) error of 5.41 dB and a cosine similarity loss of 0.0070, demonstrating the method's effectiveness in HRTF upsampling.

Paper Structure

This paper contains 19 sections, 9 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: HRTF-DUNet Flowchart. The left panel illustrates the data simulation pipeline, where noisy HRTF data is generated, segmented, and transformed using spherical harmonic analysis, resulting in low-resolution noisy coefficients stored in a dataset. The Denoisy U-Net then reconstructs clean SH coefficients from these inputs. The right panel presents the overall model framework. Red arrows indicate the AE-GAN training process, including the feedback loop for parameter updates, while black arrows represent the feedforward process through the model.
  • Figure 2: Scheme of the proposed HRTF Denoisy U-Net.
  • Figure 3: Two illustrative examples (top and bottom) showcasing the HRTF Denoisy U-Net's performance on two different subjects at the same measurement location, with additive white Gaussian noise applied at an SNR of 5 dB.
  • Figure 4: Log-spectral distortion (LSD) error comparison.