Speech Recognition with Deep Recurrent Neural Networks
Authors
Alex Graves, Abdel-rahman Mohamed, Geoffrey Hinton
Abstract
Recurrent neural networks (RNNs) are a powerful model for sequential data.
End-to-end training methods such as Connectionist Temporal Classification make
it possible to train RNNs for sequence labelling problems where the
input-output alignment is unknown. The combination of these methods with the
Long Short-term Memory RNN architecture has proved particularly fruitful,
delivering state-of-the-art results in cursive handwriting recognition. However
RNN performance in speech recognition has so far been disappointing, with
better results returned by deep feedforward networks. This paper investigates
\emph{deep recurrent neural networks}, which combine the multiple levels of
representation that have proved so effective in deep networks with the flexible
use of long range context that empowers RNNs. When trained end-to-end with
suitable regularisation, we find that deep Long Short-term Memory RNNs achieve
a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which to
our knowledge is the best recorded score.