STAB: Speech Tokenizer Assessment Benchmark

Shikhar Vashishth; Harman Singh; Shikhar Bharadwaj; Sriram Ganapathy; Chulayuth Asawaroengchai; Kartik Audhkhasi; Andrew Rosenberg; Ankur Bapna; Bhuvana Ramabhadran

STAB: Speech Tokenizer Assessment Benchmark

Shikhar Vashishth, Harman Singh, Shikhar Bharadwaj, Sriram Ganapathy, Chulayuth Asawaroengchai, Kartik Audhkhasi, Andrew Rosenberg, Ankur Bapna, Bhuvana Ramabhadran

TL;DR

STAB (Speech Tokenizer Assessment Benchmark) is presented, a systematic evaluation framework designed to assess speech tokenizers comprehensively and shed light on their inherent characteristics to offer a deeper understanding of the underlying mechanisms of speech tokenization.

Abstract

Representing speech as discrete tokens provides a framework for transforming speech into a format that closely resembles text, thus enabling the use of speech as an input to the widely successful large language models (LLMs). Currently, while several speech tokenizers have been proposed, there is ambiguity regarding the properties that are desired from a tokenizer for specific downstream tasks and its overall generalizability. Evaluating the performance of tokenizers across different downstream tasks is a computationally intensive effort that poses challenges for scalability. To circumvent this requirement, we present STAB (Speech Tokenizer Assessment Benchmark), a systematic evaluation framework designed to assess speech tokenizers comprehensively and shed light on their inherent characteristics. This framework provides a deeper understanding of the underlying mechanisms of speech tokenization, thereby offering a valuable resource for expediting the advancement of future tokenizer models and enabling comparative analysis using a standardized benchmark. We evaluate the STAB metrics and correlate this with downstream task performance across a range of speech tasks and tokenizer choices.

STAB: Speech Tokenizer Assessment Benchmark

TL;DR

Abstract

Paper Structure (14 sections, 3 figures, 2 tables)

This paper contains 14 sections, 3 figures, 2 tables.

Introduction
Related Work
STAB Details
Invariance
Robustness
Compressibility
Vocabulary
Experimental Setup
Datasets
Baseline systems
Results
STAB Performance Comparison
Correlation with Downstream tasks
Conclusion

Figures (3)

Figure 1: STAB's Invariance Dimensions: (left) illustrates speaker invariance, comparing the tokenization of the same sentence spoken by two different speakers. (right) demonstrates context invariance, comparing the tokenization of an initial segment of the speech signal with and without the availability of the original context. Refer to Section \ref{['sec:details']} for more details.
Figure 2: Vocabulary distribution for USM-v1 and USM-v2 tokenizers. The inclusion of ASR loss allows the tokenizer to capture language relatedness. Refer to Section \ref{['sec:stab_comparison']} for details.
Figure 3: Correlation plot showing the relationship between the STAB metrics (Table \ref{['tbl:stab_main']}) and the downstream task performance (Table \ref{['tbl:downstream_tasks']}). Here, pairs of tokenizers are considered and the correlation is computed between the relative improvements in STAB metrics w.r.t. the relative improvements in task performance.

STAB: Speech Tokenizer Assessment Benchmark

TL;DR

Abstract

STAB: Speech Tokenizer Assessment Benchmark

Authors

TL;DR

Abstract

Table of Contents

Figures (3)