Deep Recurrent Neural Networks for Acoustic Modelling
William Chan, Ian Lane
TL;DR
The paper addresses acoustic modelling for ASR by introducing a deep recurrent architecture that combines a Time Convolutional DNN front-end, a Bidirectional LSTM context model, and a final DNN for classification. This TC-DNN-BLSTM-DNN approach yields a 3.47 WER on WSJ eval92, an ~8% relative improvement over strong DNN baselines, and demonstrates practical trainability with standard SGD and potential speedups via ASGD. Ablation studies show the importance of each component, with learned feature processing enhancing BLSTM performance over raw fMLLR inputs. Overall, the work presents a scalable, high-performing architecture for sequential acoustic modelling in ASR, with implications for faster training using distributed optimization.
Abstract
We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.
