Table of Contents
Fetching ...

Towards Environmental Preference Based Speech Enhancement For Individualised Multi-Modal Hearing Aids

Jasper Kirton-Wingate, Shafique Ahmed, Adeel Hussain, Mandar Gogate, Kia Dashtipour, Jen-Cheng Hou, Tassadaq Hussain, Yu Tsao, Amir Hussain

TL;DR

This work addresses the lack of personalised speech enhancement for hearing aids by introducing a preference-learning based SE (PLSE) framework that jointly models environmental SNR and acoustic scene classification (ASC) through a multi-task deep network. A preference-elicitation module tunes the target SNR $SNR^*$ of an AVSE system according to context, enabling contextually individualised noise reduction without sacrificing intelligibility. The approach leverages shared representations to couple SNR prediction with ASC, uses attention-augmented CNN-BiLSTM encoding, and evaluates via subjective tests on NH and HI listeners with GRID-CHIME3 data, showing preliminary improvements over non-individualised baselines. The study lays groundwork for real-world HA deployment, but is limited by small participant numbers and simplified ambient-occlusion assumptions, pointing to future work on unseen scenes, multi-speaker scenarios, and clinical integration.

Abstract

Since the advent of Deep Learning (DL), Speech Enhancement (SE) models have performed well under a variety of noise conditions. However, such systems may still introduce sonic artefacts, sound unnatural, and restrict the ability for a user to hear ambient sound which may be of importance. Hearing Aid (HA) users may wish to customise their SE systems to suit their personal preferences and day-to-day lifestyle. In this paper, we introduce a preference learning based SE (PLSE) model for future multi-modal HAs that can contextually exploit audio information to improve listening comfort, based upon the preferences of the user. The proposed system estimates the Signal-to-noise ratio (SNR) as a basic objective speech quality measure which quantifies the relative amount of background noise present in speech, and directly correlates to the intelligibility of the signal. Additionally, to provide contextual information we predict the acoustic scene in which the user is situated. These tasks are achieved via a multi-task DL model, which surpasses the performance of inferring the acoustic scene or SNR separately, by jointly leveraging a shared encoded feature space. These environmental inferences are exploited in a preference elicitation framework, which linearly learns a set of predictive functions to determine the target SNR of an AV (Audio-Visual) SE system. By greatly reducing noise in challenging listening conditions, and by novelly scaling the output of the SE model, we are able to provide HA users with contextually individualised SE. Preliminary results suggest an improvement over the non-individualised baseline model in some participants.

Towards Environmental Preference Based Speech Enhancement For Individualised Multi-Modal Hearing Aids

TL;DR

This work addresses the lack of personalised speech enhancement for hearing aids by introducing a preference-learning based SE (PLSE) framework that jointly models environmental SNR and acoustic scene classification (ASC) through a multi-task deep network. A preference-elicitation module tunes the target SNR of an AVSE system according to context, enabling contextually individualised noise reduction without sacrificing intelligibility. The approach leverages shared representations to couple SNR prediction with ASC, uses attention-augmented CNN-BiLSTM encoding, and evaluates via subjective tests on NH and HI listeners with GRID-CHIME3 data, showing preliminary improvements over non-individualised baselines. The study lays groundwork for real-world HA deployment, but is limited by small participant numbers and simplified ambient-occlusion assumptions, pointing to future work on unseen scenes, multi-speaker scenarios, and clinical integration.

Abstract

Since the advent of Deep Learning (DL), Speech Enhancement (SE) models have performed well under a variety of noise conditions. However, such systems may still introduce sonic artefacts, sound unnatural, and restrict the ability for a user to hear ambient sound which may be of importance. Hearing Aid (HA) users may wish to customise their SE systems to suit their personal preferences and day-to-day lifestyle. In this paper, we introduce a preference learning based SE (PLSE) model for future multi-modal HAs that can contextually exploit audio information to improve listening comfort, based upon the preferences of the user. The proposed system estimates the Signal-to-noise ratio (SNR) as a basic objective speech quality measure which quantifies the relative amount of background noise present in speech, and directly correlates to the intelligibility of the signal. Additionally, to provide contextual information we predict the acoustic scene in which the user is situated. These tasks are achieved via a multi-task DL model, which surpasses the performance of inferring the acoustic scene or SNR separately, by jointly leveraging a shared encoded feature space. These environmental inferences are exploited in a preference elicitation framework, which linearly learns a set of predictive functions to determine the target SNR of an AV (Audio-Visual) SE system. By greatly reducing noise in challenging listening conditions, and by novelly scaling the output of the SE model, we are able to provide HA users with contextually individualised SE. Preliminary results suggest an improvement over the non-individualised baseline model in some participants.
Paper Structure (37 sections, 5 equations, 12 figures, 1 table)

This paper contains 37 sections, 5 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: PLSE HA System Diagram with Experimental Dataset Y (GRID-CHIME3)
  • Figure 2: The Multi-Task Acoustic Scene Classification and SNR Prediction Model illustrating a shared encoded feature space.
  • Figure 3: Confusion Matrices for the Single-task (above) and Multi-task Model (below). The multi-task achieves near perfect prediction performance on the held out set whilst the single task model confuses the acoustic scenes without clear long droning background sounds such as the bus engine. The pedestrian and cafe scenes are often confused in the single-task model which in Figure 4 there is no clear separable boundary between classes in feature representation, though this is improved in the multi-task model, displayed in Figure 5.
  • Figure 4: T-SNE Visualisation of Single-task ASC feature embeddings at the self-attention layer level. There is a lot of overlap in the feature space between the Cafe, Pedestrian and Street scenes.
  • Figure 5: T-SNE Visualisation of Multi-task ASC feature embeddings at the self-attention layer level. The representation is much more well defined than the single-task model, though the classes are not completely separable.
  • ...and 7 more figures