Table of Contents
Fetching ...

DeepSpeech models show Human-like Performance and Processing of Cochlear Implant Inputs

Cynthia R. Steinhardt, Menoua Keshishian, Nima Mesgarani, Kim Stachenfeld

TL;DR

This paper presents PhoDe, a phoneme-focused DeepSpeech2 model, to simulate how normal-hearing and cochlear-implant listeners process natural and CI-like inputs. By coupling a vocoded CI front-end with a causal LSTM phoneme predictor and aligning predictions with spoken phonemes, the authors reproduce human-like error patterns, reaction-time dynamics, and phoneme confusions under noise. They further show ERP-like, time-locked dynamics across model layers that resemble EEG signatures (N1/P2) during speech processing, suggesting a viable framework to study and optimize CI encoding strategies. The work provides a bridge between biophysical CI models and higher-level speech comprehension, with potential for guiding population-level encoding improvements and extending to other neural implants.

Abstract

Cochlear implants(CIs) are arguably the most successful neural implant, having restored hearing to over one million people worldwide. While CI research has focused on modeling the cochlear activations in response to low-level acoustic features, we hypothesize that the success of these implants is due in large part to the role of the upstream network in extracting useful features from a degraded signal and learned statistics of language to resolve the signal. In this work, we use the deep neural network (DNN) DeepSpeech2, as a paradigm to investigate how natural input and cochlear implant-based inputs are processed over time. We generate naturalistic and cochlear implant-like inputs from spoken sentences and test the similarity of model performance to human performance on analogous phoneme recognition tests. Our model reproduces error patterns in reaction time and phoneme confusion patterns under noise conditions in normal hearing and CI participant studies. We then use interpretability techniques to determine where and when confusions arise when processing naturalistic and CI-like inputs. We find that dynamics over time in each layer are affected by context as well as input type. Dynamics of all phonemes diverge during confusion and comprehension within the same time window, which is temporally shifted backward in each layer of the network. There is a modulation of this signal during processing of CI which resembles changes in human EEG signals in the auditory stream. This reduction likely relates to the reduction of encoded phoneme identity. These findings suggest that we have a viable model in which to explore the loss of speech-related information in time and that we can use it to find population-level encoding signals to target when optimizing cochlear implant inputs to improve encoding of essential speech-related information and improve perception.

DeepSpeech models show Human-like Performance and Processing of Cochlear Implant Inputs

TL;DR

This paper presents PhoDe, a phoneme-focused DeepSpeech2 model, to simulate how normal-hearing and cochlear-implant listeners process natural and CI-like inputs. By coupling a vocoded CI front-end with a causal LSTM phoneme predictor and aligning predictions with spoken phonemes, the authors reproduce human-like error patterns, reaction-time dynamics, and phoneme confusions under noise. They further show ERP-like, time-locked dynamics across model layers that resemble EEG signatures (N1/P2) during speech processing, suggesting a viable framework to study and optimize CI encoding strategies. The work provides a bridge between biophysical CI models and higher-level speech comprehension, with potential for guiding population-level encoding improvements and extending to other neural implants.

Abstract

Cochlear implants(CIs) are arguably the most successful neural implant, having restored hearing to over one million people worldwide. While CI research has focused on modeling the cochlear activations in response to low-level acoustic features, we hypothesize that the success of these implants is due in large part to the role of the upstream network in extracting useful features from a degraded signal and learned statistics of language to resolve the signal. In this work, we use the deep neural network (DNN) DeepSpeech2, as a paradigm to investigate how natural input and cochlear implant-based inputs are processed over time. We generate naturalistic and cochlear implant-like inputs from spoken sentences and test the similarity of model performance to human performance on analogous phoneme recognition tests. Our model reproduces error patterns in reaction time and phoneme confusion patterns under noise conditions in normal hearing and CI participant studies. We then use interpretability techniques to determine where and when confusions arise when processing naturalistic and CI-like inputs. We find that dynamics over time in each layer are affected by context as well as input type. Dynamics of all phonemes diverge during confusion and comprehension within the same time window, which is temporally shifted backward in each layer of the network. There is a modulation of this signal during processing of CI which resembles changes in human EEG signals in the auditory stream. This reduction likely relates to the reduction of encoded phoneme identity. These findings suggest that we have a viable model in which to explore the loss of speech-related information in time and that we can use it to find population-level encoding signals to target when optimizing cochlear implant inputs to improve encoding of essential speech-related information and improve perception.
Paper Structure (32 sections, 9 figures, 2 tables)

This paper contains 32 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Auditory System Model and Input Generation. A. Phoneme DeepSpeech 2 (PhoDe) Network with 5 LSTM layers followed by a fully-connected layer was trained to process spectrograms of sentences from various speakers in LibriSpeech. B. Cochlear implant versions of inputs were made by running the audio through the front-end processing algorithm of an Advanced Bionics cochlear implant, then transforming the electrodogram via a biophysical model and filterbanks into a vocoded version of the speech. C. During testing, the output predicted sequence of phonemes was aligned to the (target) true phoneme utterances over time with a Levenshtein’s algorithm. The number of substitutions or confusions, omissions, and additions could be determined to find the phoneme confusion matrix. D. Example 3-D projection of activation during confusion (red) and non-confusion (green) of phonemes in network layers: Layer 2 (top) and Layer 11 (bottom).
  • Figure 2: A. Human 10.1121/1.1810292 NH consonant and vowel performance in -6 dB versus low noise simulation condition pattern of confusion. B. Human10.1055/s-0031-1271950 and simulation comparison of CI listening in quiet for consonants and vowels. Shown at 5,40,70, and 90% thresholding compared to normalized maximum prediction probability per phoneme. C. Diagonal correlation, non-diagonal correlation, and KL divergence between human and simulation confusion matrices for original matrices(blue) versus shuffling of the simulation matrix for 500 shuffles (red) at each noise level. D.Statistics for CI data.
  • Figure 3: Error Rate Comparison. Errors types were Substitution(blue), Addition(yellow),Omission(green),Failed(red), Sub-Om (purple), Sub-Add(teal),Om-Add(grey),S-O-A(all three,pink). Percent of each error made per word by simulation(left) versus humans (right) in A. NH condition and B. the CI condition. C. The percent of correctly identified phonemes at all noise levels by (top) the network and (bottom) human subjects. All comparisons were made to data from 10.7874/jao.2015.19.3.144.
  • Figure 4: Reaction Time Comparison. A. CDF of reaction times for all phonemes for CI(red), NH(blue), confused (dashed), non-confused(solid). B.D. from 10.1016/j.cognition.2017.08.013. B Time to fixate to image of heard word for NH than CI. C. Reaction time for confused (left) versus non-confused(right) phonemes in CI(black) and NH(grey) conditions for the simulations with increasing noise level. D. Time to fixate image for foils -cohort (words with a similar starting phoneme e.g. wizard/whistle) and rhyme for NH and CI subjects. E. CI subject reaction time for certain and uncertain word predictions in quiet(black) and noise(red) from winn2022effortful. F. Reaction time of model for vowels(red) versus consonants(blue) for non-confused phonemes in quiet(dark) and medium noise(light) G. Reaction time for NH humans for vowel or consonant identification in quiet and noise (4-talker babble) from 10.1016/j.heares.2016.06.001.
  • Figure 5: Differences in Dynamics during Confusion and Non-confusion. A. Model activity can be parsed into time from phoneme onset to phoneme prediction per phoneme. B. Numbered layers in the network as referenced in C-G. C. Layer 11 activations for non-confusion of ‘NG’(green) (left) and confusion with ‘N’ (red). Other phoneme-related activations shown in various colors with thinner lines. The no prediction signal(black) dips during phoneme onset (green circle) and phoneme prediction(red circle). D. Raw activation in Layer 11 (left) versus utterance windows interpolated to the same length (40 model time points/400 ms). E. Z-scored distance in PC space between dynamics when phonemes are NC colored purple to light green by depth of layer in the model for NH inputs. F. Distance in PC space between dynamics during processing of NH (top) and CI (bottom) inputs during utterances that were C-P(red),C-NP(blue), NC-NP(green), and NC-P(yellow). G. Change in amplitude and latency of the peak response time with increase in noise (quiet-sand, low-peach,medium-purple) for Layer 2 and Layer 10 which have different average latencies of response. H. ERPs from whole-brain human EEG to words in quiet, stationary noise, or modulated noise in 10.1159/000452123. N1 and P2 times were found at about 130 and 250 ms delays. I. Amplitude and latency of ERP peak response under quiet and noise conditions for NH and CI listeners.
  • ...and 4 more figures