Table of Contents
Fetching ...

End-to-end audio-visual learning for cochlear implant sound coding simulations in noisy environments

Meng-Ping Lin, Enoch Hsin-Ho Huang, Shao-Yi Chien, Yu Tsao

TL;DR

This work addresses the robustness of cochlear implant sound coding in noisy environments by introducing an end-to-end audio-visual CI framework that fuses lip-reading cues with a differentiable coding network. The AVSE-ECS system combines an audio-visual speech enhancement frontend with ElectrodeNet-CS, using cross-attention to integrate visual information and enabling joint optimization of speech enhancement and electrode pattern generation. The approach yields higher objective intelligibility (STOI/ESTOI/NCM) and substantially improves signal-to-error ratio (SER) compared to ACE and audio-only baselines, with a reported SER gain of about 7.47 dB in joint training. These results highlight the potential of multimodal processing to boost CI performance in noisy settings, while acknowledging the need for subjective testing and hardware-latency considerations in future work.

Abstract

The cochlear implant (CI) is a successful biomedical device that enables individuals with severe-to-profound hearing loss to perceive sound through electrical stimulation, yet listening in noise remains challenging. Recent deep learning advances offer promising potential for CI sound coding by integrating visual cues. In this study, an audio-visual speech enhancement (AVSE) module is integrated with the ElectrodeNet-CS (ECS) model to form the end-to-end CI system, AVSE-ECS. Simulations show that the AVSE-ECS system with joint training achieves high objective speech intelligibility and improves the signal-to-error ratio (SER) by 7.4666 dB compared to the advanced combination encoder (ACE) strategy. These findings underscore the potential of AVSE-based CI sound coding.

End-to-end audio-visual learning for cochlear implant sound coding simulations in noisy environments

TL;DR

This work addresses the robustness of cochlear implant sound coding in noisy environments by introducing an end-to-end audio-visual CI framework that fuses lip-reading cues with a differentiable coding network. The AVSE-ECS system combines an audio-visual speech enhancement frontend with ElectrodeNet-CS, using cross-attention to integrate visual information and enabling joint optimization of speech enhancement and electrode pattern generation. The approach yields higher objective intelligibility (STOI/ESTOI/NCM) and substantially improves signal-to-error ratio (SER) compared to ACE and audio-only baselines, with a reported SER gain of about 7.47 dB in joint training. These results highlight the potential of multimodal processing to boost CI performance in noisy settings, while acknowledging the need for subjective testing and hardware-latency considerations in future work.

Abstract

The cochlear implant (CI) is a successful biomedical device that enables individuals with severe-to-profound hearing loss to perceive sound through electrical stimulation, yet listening in noise remains challenging. Recent deep learning advances offer promising potential for CI sound coding by integrating visual cues. In this study, an audio-visual speech enhancement (AVSE) module is integrated with the ElectrodeNet-CS (ECS) model to form the end-to-end CI system, AVSE-ECS. Simulations show that the AVSE-ECS system with joint training achieves high objective speech intelligibility and improves the signal-to-error ratio (SER) by 7.4666 dB compared to the advanced combination encoder (ACE) strategy. These findings underscore the potential of AVSE-based CI sound coding.

Paper Structure

This paper contains 23 sections, 3 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: (a) The architecture of the proposed cochlear implant (CI) system, and the area enclosed by the dashed rectangle indicates the AVSE-ECS network. (b) Overall joint-training architecture for the AVSE-ECS network.
  • Figure 2: Neural network implementations: (a) ECS, (b) ASE-ECS, (c) AVSE-ECS, and (d) cross-attention block. As illustrated in Fig. \ref{['fig:AVSE_ECS_framework']}, the color blocks represent the modules of ECS (yellow), NCSN++ (blue), and TCN visual encoder (green).