Table of Contents
Fetching ...

English Conversational Telephone Speech Recognition by Humans and Machines

George Saon, Gakuto Kurata, Tom Sercu, Kartik Audhkhasi, Samuel Thomas, Dimitrios Dimitriadis, Xiaodong Cui, Bhuvana Ramabhadran, Michael Picheny, Lynn-Li Lim, Bergul Roomi, Phil Hall

TL;DR

This paper investigates how close automated conversational speech recognition can get to human performance by benchmarking human transcription accuracy and advancing acoustic and language-modeling techniques. It employs LSTM-based and ResNet-based acoustic models, enhanced through speaker-adversarial multi-task learning and feature fusion, along with time-dilated CNNs for dense sequence prediction. On the language side, it introduces several LSTM and WaveNet-style models and demonstrates substantial WER reductions via extensive LM rescoring and model fusion, achieving 5.5%/10.3% WER on Switchboard/CallHome. Contrary to some prior claims of human parity, the study argues that parity has not been reached, especially for CallHome, and emphasizes dataset characteristics and human transcription quality in shaping realistic targets for future work.

Abstract

One of the most difficult speech recognition tasks is accurate recognition of human to human communication. Advances in deep learning over the last few years have produced major speech recognition improvements on the representative Switchboard conversational corpus. Word error rates that just a few years ago were 14% have dropped to 8.0%, then 6.6% and most recently 5.8%, and are now believed to be within striking range of human performance. This then raises two issues - what IS human performance, and how far down can we still drive speech recognition error rates? A recent paper by Microsoft suggests that we have already achieved human performance. In trying to verify this statement, we performed an independent set of human performance measurements on two conversational tasks and found that human performance may be considerably better than what was earlier reported, giving the community a significantly harder goal to achieve. We also report on our own efforts in this area, presenting a set of acoustic and language modeling techniques that lowered the word error rate of our own English conversational telephone LVCSR system to the level of 5.5%/10.3% on the Switchboard/CallHome subsets of the Hub5 2000 evaluation, which - at least at the writing of this paper - is a new performance milestone (albeit not at what we measure to be human performance!). On the acoustic side, we use a score fusion of three models: one LSTM with multiple feature inputs, a second LSTM trained with speaker-adversarial multi-task learning and a third residual net (ResNet) with 25 convolutional layers and time-dilated convolutions. On the language modeling side, we use word and character LSTMs and convolutional WaveNet-style language models.

English Conversational Telephone Speech Recognition by Humans and Machines

TL;DR

This paper investigates how close automated conversational speech recognition can get to human performance by benchmarking human transcription accuracy and advancing acoustic and language-modeling techniques. It employs LSTM-based and ResNet-based acoustic models, enhanced through speaker-adversarial multi-task learning and feature fusion, along with time-dilated CNNs for dense sequence prediction. On the language side, it introduces several LSTM and WaveNet-style models and demonstrates substantial WER reductions via extensive LM rescoring and model fusion, achieving 5.5%/10.3% WER on Switchboard/CallHome. Contrary to some prior claims of human parity, the study argues that parity has not been reached, especially for CallHome, and emphasizes dataset characteristics and human transcription quality in shaping realistic targets for future work.

Abstract

One of the most difficult speech recognition tasks is accurate recognition of human to human communication. Advances in deep learning over the last few years have produced major speech recognition improvements on the representative Switchboard conversational corpus. Word error rates that just a few years ago were 14% have dropped to 8.0%, then 6.6% and most recently 5.8%, and are now believed to be within striking range of human performance. This then raises two issues - what IS human performance, and how far down can we still drive speech recognition error rates? A recent paper by Microsoft suggests that we have already achieved human performance. In trying to verify this statement, we performed an independent set of human performance measurements on two conversational tasks and found that human performance may be considerably better than what was earlier reported, giving the community a significantly harder goal to achieve. We also report on our own efforts in this area, presenting a set of acoustic and language modeling techniques that lowered the word error rate of our own English conversational telephone LVCSR system to the level of 5.5%/10.3% on the Switchboard/CallHome subsets of the Hub5 2000 evaluation, which - at least at the writing of this paper - is a new performance milestone (albeit not at what we measure to be human performance!). On the acoustic side, we use a score fusion of three models: one LSTM with multiple feature inputs, a second LSTM trained with speaker-adversarial multi-task learning and a third residual net (ResNet) with 25 convolutional layers and time-dilated convolutions. On the language modeling side, we use word and character LSTMs and convolutional WaveNet-style language models.

Paper Structure

This paper contains 8 sections, 1 equation, 3 figures, 9 tables.

Figures (3)

  • Figure 1: LSTM with speaker-adversarial MTL architecture.
  • Figure 2: Residual connections on sequences. The convolutions are unpadded and reduce the size of the feature maps in the time direction (indicated with red dashed lines). To match this reduction, we simply crop the edges along the time on the shortcut connection.
  • Figure 3: