Table of Contents
Fetching ...

The CAPIO 2017 Conversational Speech Recognition System

Kyu J. Han, Akshay Chandrashekaran, Jungsuk Kim, Ian Lane

TL;DR

The paper targets state-of-the-art conversational speech recognition on the NIST 2000 Hub5 English benchmark by introducing densely connected LSTM architectures (dense TDNN-LSTM and dense CNN-bLSTM) and a simple acoustic model adaptation via parameter averaging. It demonstrates substantial WER reductions, achieving 5.0% on Switchboard and 9.1% on CallHome, with a human-parity claim in the Switchboard domain. Beyond telephony, the approach generalizes to non-telephony data (TED-LIUM, LibriSpeech) with strong results, aided by diverse language models and a lattice-based system fusion. The work underscores the value of deep, densely connected architectures, lightweight adaptation strategies, and multi-system fusion for robust, high-performance conversational speech recognition.

Abstract

In this paper we show how we have achieved the state-of-the-art performance on the industry-standard NIST 2000 Hub5 English evaluation set. We explore densely connected LSTMs, inspired by the densely connected convolutional networks recently introduced for image classification tasks. We also propose an acoustic model adaptation scheme that simply averages the parameters of a seed neural network acoustic model and its adapted version. This method was applied with the CallHome training corpus and improved individual system performances by on average 6.1% (relative) against the CallHome portion of the evaluation set with no performance loss on the Switchboard portion. With RNN-LM rescoring and lattice combination on the 5 systems trained across three different phone sets, our 2017 speech recognition system has obtained 5.0% and 9.1% on Switchboard and CallHome, respectively, both of which are the best word error rates reported thus far. According to IBM in their latest work to compare human and machine transcriptions, our reported Switchboard word error rate can be considered to surpass the human parity (5.1%) of transcribing conversational telephone speech.

The CAPIO 2017 Conversational Speech Recognition System

TL;DR

The paper targets state-of-the-art conversational speech recognition on the NIST 2000 Hub5 English benchmark by introducing densely connected LSTM architectures (dense TDNN-LSTM and dense CNN-bLSTM) and a simple acoustic model adaptation via parameter averaging. It demonstrates substantial WER reductions, achieving 5.0% on Switchboard and 9.1% on CallHome, with a human-parity claim in the Switchboard domain. Beyond telephony, the approach generalizes to non-telephony data (TED-LIUM, LibriSpeech) with strong results, aided by diverse language models and a lattice-based system fusion. The work underscores the value of deep, densely connected architectures, lightweight adaptation strategies, and multi-system fusion for robust, high-performance conversational speech recognition.

Abstract

In this paper we show how we have achieved the state-of-the-art performance on the industry-standard NIST 2000 Hub5 English evaluation set. We explore densely connected LSTMs, inspired by the densely connected convolutional networks recently introduced for image classification tasks. We also propose an acoustic model adaptation scheme that simply averages the parameters of a seed neural network acoustic model and its adapted version. This method was applied with the CallHome training corpus and improved individual system performances by on average 6.1% (relative) against the CallHome portion of the evaluation set with no performance loss on the Switchboard portion. With RNN-LM rescoring and lattice combination on the 5 systems trained across three different phone sets, our 2017 speech recognition system has obtained 5.0% and 9.1% on Switchboard and CallHome, respectively, both of which are the best word error rates reported thus far. According to IBM in their latest work to compare human and machine transcriptions, our reported Switchboard word error rate can be considered to surpass the human parity (5.1%) of transcribing conversational telephone speech.

Paper Structure

This paper contains 13 sections, 2 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: WER comparison between residual and dense connection for LSTMs with the cell dimension of 128.
  • Figure 2: Structure of a dense TDNN-LSTM acoustic model. Each dense block outputs 1,024 dimensional non-linear activation vectors.
  • Figure 3: Structure of a dense CNN-bLSTM acoustic model. Each dense block has 256 dimensional non-linear activation vectors.