Table of Contents
Fetching ...

Deep Recurrent Neural Networks for Acoustic Modelling

William Chan, Ian Lane

TL;DR

The paper addresses acoustic modelling for ASR by introducing a deep recurrent architecture that combines a Time Convolutional DNN front-end, a Bidirectional LSTM context model, and a final DNN for classification. This TC-DNN-BLSTM-DNN approach yields a 3.47 WER on WSJ eval92, an ~8% relative improvement over strong DNN baselines, and demonstrates practical trainability with standard SGD and potential speedups via ASGD. Ablation studies show the importance of each component, with learned feature processing enhancing BLSTM performance over raw fMLLR inputs. Overall, the work presents a scalable, high-performing architecture for sequential acoustic modelling in ASR, with implications for faster training using distributed optimization.

Abstract

We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.

Deep Recurrent Neural Networks for Acoustic Modelling

TL;DR

The paper addresses acoustic modelling for ASR by introducing a deep recurrent architecture that combines a Time Convolutional DNN front-end, a Bidirectional LSTM context model, and a final DNN for classification. This TC-DNN-BLSTM-DNN approach yields a 3.47 WER on WSJ eval92, an ~8% relative improvement over strong DNN baselines, and demonstrates practical trainability with standard SGD and potential speedups via ASGD. Ablation studies show the importance of each component, with learned feature processing enhancing BLSTM performance over raw fMLLR inputs. Overall, the work presents a scalable, high-performing architecture for sequential acoustic modelling in ASR, with implications for faster training using distributed optimization.

Abstract

We present a novel deep Recurrent Neural Network (RNN) model for acoustic modelling in Automatic Speech Recognition (ASR). We term our contribution as a TC-DNN-BLSTM-DNN model, the model combines a Deep Neural Network (DNN) with Time Convolution (TC), followed by a Bidirectional Long Short-Term Memory (BLSTM), and a final DNN. The first DNN acts as a feature processor to our model, the BLSTM then generates a context from the sequence acoustic signal, and the final DNN takes the context and models the posterior probabilities of the acoustic states. We achieve a 3.47 WER on the Wall Street Journal (WSJ) eval92 task or more than 8% relative improvement over the baseline DNN models.

Paper Structure

This paper contains 9 sections, 2 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: TC-DNN-BLSTM-DNN Architecture. The model contains 3 parts, a signal processing DNN which takes in the original fMLLR acoustic features and projects them to a high dimensional space, a BLSTM which models the sequential signal and produces a context, and a final DNN which takes the context generated by the BLSTM and estimates the likelihoods across acoustic states.
  • Figure 2: SGD vs x3 ASGD WER convergence comparison, each point represents one epoch of the respective optimizer.