Table of Contents
Fetching ...

A cost minimization approach to fix the vocabulary size in a tokenizer for an End-to-End ASR system

Sunil Kumar Kopparapu, Ashish Panda

TL;DR

The paper tackles how to select an optimal sub-word vocabulary size for tokenizers used in end-to-end ASR by treating tokenization as a black box and optimizing a cost function over the number of tokens $n$. It formalizes the problem with a text corpus ${\cal S}$ and a tokenizer ${\cal T}$, introducing a cost $C = \alpha_1 n + \alpha_2 \left(\frac{f^+}{f^-} - 1\right) + \alpha_3 \left(\frac{\theta_t}{w} - 1\right)$ to obtain the optimal $n^*$. Using LibriSpeech-100h and a conformer encoder–decoder in ESPNet, it compares SentencePiece-Unigram and SentencePiece-BPE across $n$ values, finding that the best results occur at $n^*=145$ (Unigram) and $n^*=70$ (BPE) with balanced weights, outperforming the ESPNet baseline of $n=300$ and illustrating that tokenization choices substantially influence ASR performance and compute cost. The work concludes that the proposed cost-function approach can effectively identify vocabulary sizes that improve accuracy while reducing computational demands, and it points to future refinements of the weighting parameters and inclusion of additional factors.

Abstract

Unlike hybrid speech recognition systems where the use of tokens was restricted to phones, biphones or triphones the choice of tokens in the end-to-end ASR systems is derived from the text corpus of the training data. The use of tokenization algorithms like Byte Pair Encoding (BPE) and WordPiece is popular in identifying the tokens that are used in the overall training process of the speech recognition system. Popular toolkits, like ESPNet use a pre-defined vocabulary size (number of tokens) for these tokenization algorithms, but there is no discussion on how vocabulary size was derived. In this paper, we build a cost function, assuming the tokenization process to be a black-box to enable choosing the number of tokens which might most benefit building an end-to-end ASR. We show through experiments on LibriSpeech 100 hour set that the performance of an end-to-end ASR system improves when the number of tokens are chosen carefully.

A cost minimization approach to fix the vocabulary size in a tokenizer for an End-to-End ASR system

TL;DR

The paper tackles how to select an optimal sub-word vocabulary size for tokenizers used in end-to-end ASR by treating tokenization as a black box and optimizing a cost function over the number of tokens . It formalizes the problem with a text corpus and a tokenizer , introducing a cost to obtain the optimal . Using LibriSpeech-100h and a conformer encoder–decoder in ESPNet, it compares SentencePiece-Unigram and SentencePiece-BPE across values, finding that the best results occur at (Unigram) and (BPE) with balanced weights, outperforming the ESPNet baseline of and illustrating that tokenization choices substantially influence ASR performance and compute cost. The work concludes that the proposed cost-function approach can effectively identify vocabulary sizes that improve accuracy while reducing computational demands, and it points to future refinements of the weighting parameters and inclusion of additional factors.

Abstract

Unlike hybrid speech recognition systems where the use of tokens was restricted to phones, biphones or triphones the choice of tokens in the end-to-end ASR systems is derived from the text corpus of the training data. The use of tokenization algorithms like Byte Pair Encoding (BPE) and WordPiece is popular in identifying the tokens that are used in the overall training process of the speech recognition system. Popular toolkits, like ESPNet use a pre-defined vocabulary size (number of tokens) for these tokenization algorithms, but there is no discussion on how vocabulary size was derived. In this paper, we build a cost function, assuming the tokenization process to be a black-box to enable choosing the number of tokens which might most benefit building an end-to-end ASR. We show through experiments on LibriSpeech 100 hour set that the performance of an end-to-end ASR system improves when the number of tokens are chosen carefully.
Paper Structure (5 sections, 8 equations, 4 figures, 1 table)

This paper contains 5 sections, 8 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: $t_1, t_2, t_3$ in the cost function $C$ (\ref{['eq:mimimize']}) for $n=30$ to $1000$.
  • Figure 2: SentencePiece-Unigram. $n^*$ marked with a red "*". x-axis shows the number of tokens and y-axis the ${{C}}$.
  • Figure 3: SentencePiece-BPE. $n^*$ marked with a red "*". x-axis shows the number of tokens and y-axis the ${{C}}$.
  • Figure 4: Optimal number of tokens ($n^*$) for $\alpha_{1,2,3} = (1,1,1)$