Table of Contents
Fetching ...

BEST-STD2.0: Balanced and Efficient Speech Tokenizer for Spoken Term Detection

Anup Singh, Vipul Arora, Kris Demuynck

TL;DR

This work proposes a noise and reverberation-augmented training strategy to improve tokenizer robustness and introduces optimal transport-based regularization to ensure balanced token usage and enhance token efficiency.

Abstract

Fast and accurate spoken content retrieval is vital for applications such as voice search. Query-by-Example Spoken Term Detection (STD) involves retrieving matching segments from an audio database given a spoken query. Token-based STD systems, which use discrete speech representations, enable efficient search but struggle with robustness to noise and reverberation, and with inefficient token utilization. We address these challenges by proposing a noise and reverberation-augmented training strategy to improve tokenizer robustness. In addition, we introduce optimal transport-based regularization to ensure balanced token usage and enhance token efficiency. To further speed up retrieval, we adopt a TF-IDF-based search mechanism. Empirical evaluations demonstrate that the proposed method outperforms STD baselines across various distortion levels while maintaining high search efficiency.

BEST-STD2.0: Balanced and Efficient Speech Tokenizer for Spoken Term Detection

TL;DR

This work proposes a noise and reverberation-augmented training strategy to improve tokenizer robustness and introduces optimal transport-based regularization to ensure balanced token usage and enhance token efficiency.

Abstract

Fast and accurate spoken content retrieval is vital for applications such as voice search. Query-by-Example Spoken Term Detection (STD) involves retrieving matching segments from an audio database given a spoken query. Token-based STD systems, which use discrete speech representations, enable efficient search but struggle with robustness to noise and reverberation, and with inefficient token utilization. We address these challenges by proposing a noise and reverberation-augmented training strategy to improve tokenizer robustness. In addition, we introduce optimal transport-based regularization to ensure balanced token usage and enhance token efficiency. To further speed up retrieval, we adopt a TF-IDF-based search mechanism. Empirical evaluations demonstrate that the proposed method outperforms STD baselines across various distortion levels while maintaining high search efficiency.

Paper Structure

This paper contains 16 sections, 10 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Illustration of our self-supervised learning framework for robust speech tokenization.