Table of Contents
Fetching ...

The Microsoft 2016 Conversational Speech Recognition System

W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, G. Zweig

TL;DR

The paper tackles conversational telephone speech recognition on Switchboard by integrating deep CNN and LSTM acoustic models with i-vector speaker adaptation, lattice-free sequence training (LFMMI), and sophisticated RNNLM rescoring. It introduces diverse CNN architectures (VGG-like, ResNet, LACE) and a bidirectional LSTM backend, augmented by speaker adaptation and LFMMI to improve alignment and learning. A key contribution is robust language model rescoring using forward and backward RNNLMs trained on mixed-domain data, combined with a large N-gram LM and a confusion-network-based system combination, yielding significant WER reductions. The results include a best single-system WER of 6.9% and an ensemble WER of 6.2% on the NIST 2000 Switchboard task, marking a new performance benchmark for CTS without relying solely on post-processing lattices.

Abstract

We describe Microsoft's conversational speech recognition system, in which we combine recent developments in neural-network-based acoustic and language modeling to advance the state of the art on the Switchboard recognition task. Inspired by machine learning ensemble techniques, the system uses a range of convolutional and recurrent neural networks. I-vector modeling and lattice-free MMI training provide significant gains for all acoustic model architectures. Language model rescoring with multiple forward and backward running RNNLMs, and word posterior-based system combination provide a 20% boost. The best single system uses a ResNet architecture acoustic model with RNNLM rescoring, and achieves a word error rate of 6.9% on the NIST 2000 Switchboard task. The combined system has an error rate of 6.2%, representing an improvement over previously reported results on this benchmark task.

The Microsoft 2016 Conversational Speech Recognition System

TL;DR

The paper tackles conversational telephone speech recognition on Switchboard by integrating deep CNN and LSTM acoustic models with i-vector speaker adaptation, lattice-free sequence training (LFMMI), and sophisticated RNNLM rescoring. It introduces diverse CNN architectures (VGG-like, ResNet, LACE) and a bidirectional LSTM backend, augmented by speaker adaptation and LFMMI to improve alignment and learning. A key contribution is robust language model rescoring using forward and backward RNNLMs trained on mixed-domain data, combined with a large N-gram LM and a confusion-network-based system combination, yielding significant WER reductions. The results include a best single-system WER of 6.9% and an ensemble WER of 6.2% on the NIST 2000 Switchboard task, marking a new performance benchmark for CTS without relying solely on post-processing lattices.

Abstract

We describe Microsoft's conversational speech recognition system, in which we combine recent developments in neural-network-based acoustic and language modeling to advance the state of the art on the Switchboard recognition task. Inspired by machine learning ensemble techniques, the system uses a range of convolutional and recurrent neural networks. I-vector modeling and lattice-free MMI training provide significant gains for all acoustic model architectures. Language model rescoring with multiple forward and backward running RNNLMs, and word posterior-based system combination provide a 20% boost. The best single system uses a ResNet architecture acoustic model with RNNLM rescoring, and achieves a word error rate of 6.9% on the NIST 2000 Switchboard task. The combined system has an error rate of 6.2%, representing an improvement over previously reported results on this benchmark task.

Paper Structure

This paper contains 16 sections, 1 equation, 1 figure, 5 tables.

Figures (1)

  • Figure 1: LACE network architecture