Table of Contents
Fetching ...

U2++: Unified Two-pass Bidirectional End-to-end Model for Speech Recognition

Di Wu, Binbin Zhang, Chao Yang, Zhendong Peng, Wenjing Xia, Xiaoyu Chen, Xin Lei

TL;DR

<3-5 sentence high-level summary> U2++ tackles the challenge of achieving accurate streaming and non-streaming ASR within a unified framework. It introduces bidirectional decoders and joint CTC-AED training, plus SpecSub augmentation, enabling improved convergence and robustness. In experiments on AISHELL-1 and AISHELL-2, it achieves 5.05% CER streaming on AISHELL-1 and 4.63% CER non-streaming, with best streaming results among published methods. The method provides a practical two-pass re-scoring strategy with flexible decoding and decoder allocation, improving real-time performance.

Abstract

The unified streaming and non-streaming two-pass (U2) end-to-end model for speech recognition has shown great performance in terms of streaming capability, accuracy, real-time factor (RTF), and latency. In this paper, we present U2++, an enhanced version of U2 to further improve the accuracy. The core idea of U2++ is to use the forward and the backward information of the labeling sequences at the same time at training to learn richer information, and combine the forward and backward prediction at decoding to give more accurate recognition results. We also proposed a new data augmentation method called SpecSub to help the U2++ model to be more accurate and robust. Our experiments show that, compared with U2, U2++ shows faster convergence at training, better robustness to the decoding method, as well as consistent 5\% - 8\% word error rate reduction gain over U2. On the experiment of AISHELL-1, we achieve a 4.63\% character error rate (CER) with a non-streaming setup and 5.05\% with a streaming setup with 320ms latency by U2++. To the best of our knowledge, 5.05\% is the best-published streaming result on the AISHELL-1 test set.

U2++: Unified Two-pass Bidirectional End-to-end Model for Speech Recognition

TL;DR

<3-5 sentence high-level summary> U2++ tackles the challenge of achieving accurate streaming and non-streaming ASR within a unified framework. It introduces bidirectional decoders and joint CTC-AED training, plus SpecSub augmentation, enabling improved convergence and robustness. In experiments on AISHELL-1 and AISHELL-2, it achieves 5.05% CER streaming on AISHELL-1 and 4.63% CER non-streaming, with best streaming results among published methods. The method provides a practical two-pass re-scoring strategy with flexible decoding and decoder allocation, improving real-time performance.

Abstract

The unified streaming and non-streaming two-pass (U2) end-to-end model for speech recognition has shown great performance in terms of streaming capability, accuracy, real-time factor (RTF), and latency. In this paper, we present U2++, an enhanced version of U2 to further improve the accuracy. The core idea of U2++ is to use the forward and the backward information of the labeling sequences at the same time at training to learn richer information, and combine the forward and backward prediction at decoding to give more accurate recognition results. We also proposed a new data augmentation method called SpecSub to help the U2++ model to be more accurate and robust. Our experiments show that, compared with U2, U2++ shows faster convergence at training, better robustness to the decoding method, as well as consistent 5\% - 8\% word error rate reduction gain over U2. On the experiment of AISHELL-1, we achieve a 4.63\% character error rate (CER) with a non-streaming setup and 5.05\% with a streaming setup with 320ms latency by U2++. To the best of our knowledge, 5.05\% is the best-published streaming result on the AISHELL-1 test set.

Paper Structure

This paper contains 17 sections, 3 equations, 3 figures, 7 tables, 1 algorithm.

Figures (3)

  • Figure 1: Two pass CTC and AED joint architecture
  • Figure 2: left attention mask and right attention mask
  • Figure 3: The loss comparison of U2++ and U2