Table of Contents
Fetching ...

A dataset and model for recognition of audiologically relevant environments for hearing aids: AHEAD-DS and YAMNet+

Henry Zhong, Jörg M. Buchholz, Julian Maclaren, Simon Carlile, Richard Lyon

TL;DR

This work tackles the lack of public, standardized benchmarks for audiologically relevant scene recognition in hearing devices and the challenge of deploying models on edge hardware. It introduces AHEAD-DS, a ready-to-use dataset with 14 clinically relevant labels derived from HEAR-DS and CHiME-6 Dev, and YAMNet+, a lightweight, edge-friendly sound recognition model trained with transfer learning from AudioSet. On the AHEAD-DS test set, YAMNet+ achieves a mean average precision of 0.83 and an accuracy of 0.93, with real-time inference demonstrated on a Google Pixel 3 (approximately 50 ms to load the model and ~30 ms per additional second). The combination provides a publicly accessible benchmark and an open-source, deployable baseline workflow to accelerate research and deployment of hearing-device scene recognition.

Abstract

Scene recognition of audiologically relevant environments is important for hearing aids; however, it is challenging, in part because of the limitations of existing datasets. Datasets often lack public accessibility, completeness, or audiologically relevant labels, hindering systematic comparison of machine learning models. Deploying these models on resource-constrained edge devices presents another challenge. Our solution is two-fold: we leverage several open source datasets to create AHEAD-DS, a dataset designed for scene recognition of audiologically relevant environments, and introduce YAMNet+, a sound recognition model. AHEAD-DS aims to provide a standardised, publicly available dataset with consistent labels relevant to hearing aids, facilitating model comparison. YAMNet+ is designed for deployment on edge devices like smartphones connected to hearing devices, such as hearing aids and wireless earphones with hearing aid functionality; serving as a baseline model for sound-based scene recognition. YAMNet+ achieved a mean average precision of 0.83 and accuracy of 0.93 on the testing set of AHEAD-DS across fourteen categories of audiologically relevant environments. We found that applying transfer learning from the pretrained YAMNet model was essential. We demonstrated real-time sound-based scene recognition capabilities on edge devices by deploying YAMNet+ to an Android smartphone. Even with a Google Pixel 3 (a phone with modest specifications, released in 2018), the model processes audio with approximately 50ms of latency to load the model, and an approximate linear increase of 30ms per 1 second of audio. Our website and code https://github.com/Australian-Future-Hearing-Initiative .

A dataset and model for recognition of audiologically relevant environments for hearing aids: AHEAD-DS and YAMNet+

TL;DR

This work tackles the lack of public, standardized benchmarks for audiologically relevant scene recognition in hearing devices and the challenge of deploying models on edge hardware. It introduces AHEAD-DS, a ready-to-use dataset with 14 clinically relevant labels derived from HEAR-DS and CHiME-6 Dev, and YAMNet+, a lightweight, edge-friendly sound recognition model trained with transfer learning from AudioSet. On the AHEAD-DS test set, YAMNet+ achieves a mean average precision of 0.83 and an accuracy of 0.93, with real-time inference demonstrated on a Google Pixel 3 (approximately 50 ms to load the model and ~30 ms per additional second). The combination provides a publicly accessible benchmark and an open-source, deployable baseline workflow to accelerate research and deployment of hearing-device scene recognition.

Abstract

Scene recognition of audiologically relevant environments is important for hearing aids; however, it is challenging, in part because of the limitations of existing datasets. Datasets often lack public accessibility, completeness, or audiologically relevant labels, hindering systematic comparison of machine learning models. Deploying these models on resource-constrained edge devices presents another challenge. Our solution is two-fold: we leverage several open source datasets to create AHEAD-DS, a dataset designed for scene recognition of audiologically relevant environments, and introduce YAMNet+, a sound recognition model. AHEAD-DS aims to provide a standardised, publicly available dataset with consistent labels relevant to hearing aids, facilitating model comparison. YAMNet+ is designed for deployment on edge devices like smartphones connected to hearing devices, such as hearing aids and wireless earphones with hearing aid functionality; serving as a baseline model for sound-based scene recognition. YAMNet+ achieved a mean average precision of 0.83 and accuracy of 0.93 on the testing set of AHEAD-DS across fourteen categories of audiologically relevant environments. We found that applying transfer learning from the pretrained YAMNet model was essential. We demonstrated real-time sound-based scene recognition capabilities on edge devices by deploying YAMNet+ to an Android smartphone. Even with a Google Pixel 3 (a phone with modest specifications, released in 2018), the model processes audio with approximately 50ms of latency to load the model, and an approximate linear increase of 30ms per 1 second of audio. Our website and code https://github.com/Australian-Future-Hearing-Initiative .

Paper Structure

This paper contains 25 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: A data flow diagram showing the data processing procedures. Half of the environment sounds were mixed with speech.
  • Figure 2: The architecture of YAMNet+. The model computes the log mel spectrogram from the waveform. This is passed to a MobileNet which assigns confidence values to each label.
  • Figure 3: YAMNet+ precision and recall curves. A mAP of 0.83 was achieved. A threshold was applied to determine the label for each 960 millisecond window and compared against the ground truth. A larger area under the curve indicates better performance.
  • Figure 4: YAMNet+ confusion matrix. An accuracy of 0.93 was achieved. The highest scoring label was compared to the ground truth. Each number in each cell represents the number of 960ms windows assigned to a label. A higher number along the diagonals indicates better results.
  • Figure 5:
  • ...and 3 more figures