Table of Contents
Fetching ...

Lightweight Transducer Based on Frame-Level Criterion

Genshun Wan, Mengzhi Wang, Tingzhi Mao, Hang Chen, Zhongfu Ye

TL;DR

To address the problem of imbalanced classification caused by excessive blanks in the label, the blank and non-blank probabilities are decouple and the gradient of the blank classifier is truncated to truncate the gradient of the blank classifier to the main network.

Abstract

The transducer model trained based on sequence-level criterion requires a lot of memory due to the generation of the large probability matrix. We proposed a lightweight transducer model based on frame-level criterion, which uses the results of the CTC forced alignment algorithm to determine the label for each frame. Then the encoder output can be combined with the decoder output at the corresponding time, rather than adding each element output by the encoder to each element output by the decoder as in the transducer. This significantly reduces memory and computation requirements. To address the problem of imbalanced classification caused by excessive blanks in the label, we decouple the blank and non-blank probabilities and truncate the gradient of the blank classifier to the main network. Experiments on the AISHELL-1 demonstrate that this enables the lightweight transducer to achieve similar results to transducer. Additionally, we use richer information to predict the probability of blank, achieving superior results to transducer.

Lightweight Transducer Based on Frame-Level Criterion

TL;DR

To address the problem of imbalanced classification caused by excessive blanks in the label, the blank and non-blank probabilities are decouple and the gradient of the blank classifier is truncated to truncate the gradient of the blank classifier to the main network.

Abstract

The transducer model trained based on sequence-level criterion requires a lot of memory due to the generation of the large probability matrix. We proposed a lightweight transducer model based on frame-level criterion, which uses the results of the CTC forced alignment algorithm to determine the label for each frame. Then the encoder output can be combined with the decoder output at the corresponding time, rather than adding each element output by the encoder to each element output by the decoder as in the transducer. This significantly reduces memory and computation requirements. To address the problem of imbalanced classification caused by excessive blanks in the label, we decouple the blank and non-blank probabilities and truncate the gradient of the blank classifier to the main network. Experiments on the AISHELL-1 demonstrate that this enables the lightweight transducer to achieve similar results to transducer. Additionally, we use richer information to predict the probability of blank, achieving superior results to transducer.
Paper Structure (15 sections, 5 equations, 3 figures, 3 tables)

This paper contains 15 sections, 5 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Output probability lattice defined by $Pr(k|t,u)$. The node at $(t, u)$ represents the probability of having output the first $u$ elements of the output sequence by point $t$ in the transcription sequence. The horizontal arrow leaving node $(t, u)$ represents the probability $\emptyset(t,u)$ of outputting nothing at $(t, u)$; the vertical arrow represents the probability $y(t,u)$ of outputting the element $u + 1$ of $y$.
  • Figure 2: The structure of the lightweight transducer. The encoder output $h$ is transformed to $h'$ according to CTC forced alignment result $a$, and then combined with the decoder output $g$ to input into the joint network. The decoder output $g$ is also transformed to $g'$ according to $a$, and then combined with $h$ to input into the blank network.
  • Figure 3: Python code for the CTC forced alignment algorithm.