Table of Contents
Fetching ...

Exploring the Benefits of Tokenization of Discrete Acoustic Units

Avihu Dekel, Raul Fernandez

TL;DR

This work investigates applying Byte Pair Encoding to compress inventories of phonemes and Discrete Acoustic Units (DAUs) to improve efficiency and accuracy in three tasks: grapheme-to-phoneme (G2P), grapheme-to-DAU (G2DAU), and SpeechLM-based audio generation. By deriving variable-rate token vocabularies and analyzing the trade-off between sequence length and vocabulary size, the authors demonstrate consistent improvements in performance (e.g., WER/CER, BLEU/ROUGE) and substantial speedups in training and inference across all tasks. They introduce the normalized entropy metric to quantify token balance and provide theoretical insights connecting sequence length, token imbalance, and autoregressive accuracy. The findings advocate for broader adoption of tokenization for acoustic/phonetic representations and point to future work integrating BPE-derived tokens directly into end-to-end TTS systems while managing the challenges of variable-rate tokens. These contributions have practical impact for speech processing pipelines by enabling faster, more robust learning with discrete acoustic representations.

Abstract

Tokenization algorithms that merge the units of a base vocabulary into larger, variable-rate units have become standard in natural language processing tasks. This idea, however, has been mostly overlooked when the vocabulary consists of phonemes or Discrete Acoustic Units (DAUs), an audio-based representation that is playing an increasingly important role due to the success of discrete language-modeling techniques. In this paper, we showcase the advantages of tokenization of phonetic units and of DAUs on three prediction tasks: grapheme-to-phoneme, grapheme-to-DAUs, and unsupervised speech generation using DAU language modeling. We demonstrate that tokenization yields significant improvements in terms of performance, as well as training and inference speed, across all three tasks. We also offer theoretical insights to provide some explanation for the superior performance observed.

Exploring the Benefits of Tokenization of Discrete Acoustic Units

TL;DR

This work investigates applying Byte Pair Encoding to compress inventories of phonemes and Discrete Acoustic Units (DAUs) to improve efficiency and accuracy in three tasks: grapheme-to-phoneme (G2P), grapheme-to-DAU (G2DAU), and SpeechLM-based audio generation. By deriving variable-rate token vocabularies and analyzing the trade-off between sequence length and vocabulary size, the authors demonstrate consistent improvements in performance (e.g., WER/CER, BLEU/ROUGE) and substantial speedups in training and inference across all tasks. They introduce the normalized entropy metric to quantify token balance and provide theoretical insights connecting sequence length, token imbalance, and autoregressive accuracy. The findings advocate for broader adoption of tokenization for acoustic/phonetic representations and point to future work integrating BPE-derived tokens directly into end-to-end TTS systems while managing the challenges of variable-rate tokens. These contributions have practical impact for speech processing pipelines by enabling faster, more robust learning with discrete acoustic representations.

Abstract

Tokenization algorithms that merge the units of a base vocabulary into larger, variable-rate units have become standard in natural language processing tasks. This idea, however, has been mostly overlooked when the vocabulary consists of phonemes or Discrete Acoustic Units (DAUs), an audio-based representation that is playing an increasingly important role due to the success of discrete language-modeling techniques. In this paper, we showcase the advantages of tokenization of phonetic units and of DAUs on three prediction tasks: grapheme-to-phoneme, grapheme-to-DAUs, and unsupervised speech generation using DAU language modeling. We demonstrate that tokenization yields significant improvements in terms of performance, as well as training and inference speed, across all three tasks. We also offer theoretical insights to provide some explanation for the superior performance observed.
Paper Structure (16 sections, 5 equations, 1 figure, 5 tables, 1 algorithm)

This paper contains 16 sections, 5 equations, 1 figure, 5 tables, 1 algorithm.

Figures (1)

  • Figure 1: Summarizing the benefits of BPE on DAUs/Phonemes on three tasks. The experimental setup is described in Sec. \ref{['sec:experimental']}.