End-to-end audio-visual learning for cochlear implant sound coding simulations in noisy environments
Meng-Ping Lin, Enoch Hsin-Ho Huang, Shao-Yi Chien, Yu Tsao
TL;DR
This work addresses the robustness of cochlear implant sound coding in noisy environments by introducing an end-to-end audio-visual CI framework that fuses lip-reading cues with a differentiable coding network. The AVSE-ECS system combines an audio-visual speech enhancement frontend with ElectrodeNet-CS, using cross-attention to integrate visual information and enabling joint optimization of speech enhancement and electrode pattern generation. The approach yields higher objective intelligibility (STOI/ESTOI/NCM) and substantially improves signal-to-error ratio (SER) compared to ACE and audio-only baselines, with a reported SER gain of about 7.47 dB in joint training. These results highlight the potential of multimodal processing to boost CI performance in noisy settings, while acknowledging the need for subjective testing and hardware-latency considerations in future work.
Abstract
The cochlear implant (CI) is a successful biomedical device that enables individuals with severe-to-profound hearing loss to perceive sound through electrical stimulation, yet listening in noise remains challenging. Recent deep learning advances offer promising potential for CI sound coding by integrating visual cues. In this study, an audio-visual speech enhancement (AVSE) module is integrated with the ElectrodeNet-CS (ECS) model to form the end-to-end CI system, AVSE-ECS. Simulations show that the AVSE-ECS system with joint training achieves high objective speech intelligibility and improves the signal-to-error ratio (SER) by 7.4666 dB compared to the advanced combination encoder (ACE) strategy. These findings underscore the potential of AVSE-based CI sound coding.
