Table of Contents
Fetching ...

Predicting Global HRTFs From Scanned Head Geometry Using Deep Learning and Compact Representations

Yuxiang Wang, You Zhang, Zhiyao Duan, Mark Bocko

TL;DR

This work tackles personalized HRTF prediction for spatial audio by learning a mapping from scanned head geometry to full directional HRTFs. It introduces compact representations: SH-based HRTF magnitudes across multiple frequencies, SH-based HRTF onsets, and SCH-based ear geometry, all fed to CNNs to predict global SH coefficients. The method achieves LSDs around 3.9–4.1 dB and onset errors in the tens of microseconds, with localization performance surpassing a boundary element method baseline in frontal regions. Overall, the approach enables practical, geometry-driven HRTF personalization from 3D scans with potential for real-time AR/VR audio rendering and perceptual benefits.

Abstract

In the growing field of virtual auditory display, personalized head-related transfer functions (HRTFs) play a vital role in establishing an accurate sound image for mixed and augmented reality applications. In this work, we propose an HRTF personalization method employing convolutional neural networks (CNN) to predict a subject HRTFs for all directions from their scanned head geometry. To ease the training of the CNN models, we propose novel pre-processing methods for both the head scans and HRTF data to achieve compact representations. For the head scan, we use truncated spherical cap harmonic (SCH) coefficients to represent the pinna area, which is important in the acoustic scattering process. For the HRTF data, we use truncated spherical harmonic (SH) coefficients to represent the HRTF magnitudes and onsets. One CNN model is trained to predict the SH coefficients of the HRTF magnitudes from the SCH coefficients of the scanned ear geometry and other anthropometric measurements of the head. The other CNN model is trained to predict SH coefficients of the HRTF onsets from only the anthropometric measurements of the ear, head, and torso. Combining the magnitude and onset predictions, our method is able to predict the complete and global HRTF data. A leave-one-out validation with the log-spectral distortion (LSD) metric is used for objective evaluation. The results show a decent LSD level at both spatial \& temporal dimensions compared to the ground-truth HRTFs and a lower LSD than the boundary element method (BEM) simulation of HRTFs that the database provides. The localization simulation results with an auditory model are also consistent with the objective evaluation metrics, showing the localization responses with our predicted HRTFs are significantly better than with the BEM-calculated ones.

Predicting Global HRTFs From Scanned Head Geometry Using Deep Learning and Compact Representations

TL;DR

This work tackles personalized HRTF prediction for spatial audio by learning a mapping from scanned head geometry to full directional HRTFs. It introduces compact representations: SH-based HRTF magnitudes across multiple frequencies, SH-based HRTF onsets, and SCH-based ear geometry, all fed to CNNs to predict global SH coefficients. The method achieves LSDs around 3.9–4.1 dB and onset errors in the tens of microseconds, with localization performance surpassing a boundary element method baseline in frontal regions. Overall, the approach enables practical, geometry-driven HRTF personalization from 3D scans with potential for real-time AR/VR audio rendering and perceptual benefits.

Abstract

In the growing field of virtual auditory display, personalized head-related transfer functions (HRTFs) play a vital role in establishing an accurate sound image for mixed and augmented reality applications. In this work, we propose an HRTF personalization method employing convolutional neural networks (CNN) to predict a subject HRTFs for all directions from their scanned head geometry. To ease the training of the CNN models, we propose novel pre-processing methods for both the head scans and HRTF data to achieve compact representations. For the head scan, we use truncated spherical cap harmonic (SCH) coefficients to represent the pinna area, which is important in the acoustic scattering process. For the HRTF data, we use truncated spherical harmonic (SH) coefficients to represent the HRTF magnitudes and onsets. One CNN model is trained to predict the SH coefficients of the HRTF magnitudes from the SCH coefficients of the scanned ear geometry and other anthropometric measurements of the head. The other CNN model is trained to predict SH coefficients of the HRTF onsets from only the anthropometric measurements of the ear, head, and torso. Combining the magnitude and onset predictions, our method is able to predict the complete and global HRTF data. A leave-one-out validation with the log-spectral distortion (LSD) metric is used for objective evaluation. The results show a decent LSD level at both spatial \& temporal dimensions compared to the ground-truth HRTFs and a lower LSD than the boundary element method (BEM) simulation of HRTFs that the database provides. The localization simulation results with an auditory model are also consistent with the objective evaluation metrics, showing the localization responses with our predicted HRTFs are significantly better than with the BEM-calculated ones.
Paper Structure (15 sections, 7 equations, 13 figures)

This paper contains 15 sections, 7 equations, 13 figures.

Figures (13)

  • Figure 1: Example of the HRTF magnitude pattern in dB scale and its SHT processing. (a) is the HRTF pattern at frequency of 15.9k Hz plotted in a spherical coordinate system in dB scale. The magnitude value is assigned as both the color map and the distance from each corresponding source location to the origin. (b) is the reconstructed pattern from SHT at $L =7$, and (c) shows the $(L+1)^2 = 64$ SH coefficients produced. We set the same color map in (a) and (b) for better comparison.
  • Figure 2: Example of the HRTF onset pattern and its SHT processing. (a) is the onset pattern plotted in a spherical coordinate system. The onset value is assigned as both the color map and the distance from each corresponding source location to the origin. (b) is the reconstructed pattern from SHT at $L =5$, and (c) shows the $(L+1)^2 = 36$ SH coefficients produced. We set the same color map in (a) and (b) for better comparison.
  • Figure 3: Example of head mesh mapping process. (a) is a subject's original head scan, (b) is the corresponding conformal mapping onto the unit sphere.
  • Figure 4: SCH bases up to $k = 4$ at half cone angle of 25 degrees. Numbers in parenthesis are degree index $k$ and order $m$, where $-k \leqslant m \leqslant k$.
  • Figure 5: Ear SCHA process flow. (a) is the cap area cropped from head spherical mapping, with the ear regions within. The cap is centered at the ear canal entrance, with a half-cone angle of 30 degrees. (b) is the corresponding ear mesh in the original mesh space. (c) is the remeshed ear with uniform cap sampling (9062 vertices) with a half cone angle of 25 degrees. The SCHA is then applied with these 9062 vertex samples. (d) is the reconstructed mesh after the SCHA at a truncation order of 20. Meshes in (b) and (c) are set to the same scale for better comparison.
  • ...and 8 more figures