Frontend Token Enhancement for Token-Based Speech Recognition

Takanori Ashihara; Shota Horiguchi; Kohei Matsuura; Tsubasa Ochiai; Marc Delcroix

Frontend Token Enhancement for Token-Based Speech Recognition

Takanori Ashihara, Shota Horiguchi, Kohei Matsuura, Tsubasa Ochiai, Marc Delcroix

TL;DR

This paper tackles noise robustness in token-based speech recognition by introducing four modular frontend enhancements—W2W-E, T2T-E, V2T-E, and W2T-E—that operate independently of ASR backends. Through CHiME-4 experiments, wave-to-token enhancement (W2T-E) emerges as the strongest approach, often surpassing backends that rely on continuous SSL features, while T2T-E, V2T-E, and W2W-E provide valuable insights into token-domain denoising and efficiency. A key finding is that token-level robustness does not always correlate with token sequence similarity ($UED$), underscoring the need to optimize directly for $WER$ rather than proxy metrics. Overall, the work demonstrates a practical, backend-agnostic path to robust token-based speech processing with lightweight frontends and suggests promising directions for joint optimization with token ASR and broader speech tasks.

Abstract

Discretized representations of speech signals are efficient alternatives to continuous features for various speech applications, including automatic speech recognition (ASR) and speech language models. However, these representations, such as semantic or phonetic tokens derived from clustering outputs of self-supervised learning (SSL) speech models, are susceptible to environmental noise, which can degrade backend task performance. In this work, we introduce a frontend system that estimates clean speech tokens from noisy speech and evaluate it on an ASR backend using semantic tokens. We consider four types of enhancement models based on their input/output domains: wave-to-wave, token-to-token, continuous SSL features-to-token, and wave-to-token. These models are trained independently of ASR backends. Experiments on the CHiME-4 dataset demonstrate that wave-to-token enhancement achieves the best performance among the frontends. Moreover, it mostly outperforms the ASR system based on continuous SSL features.

Frontend Token Enhancement for Token-Based Speech Recognition

TL;DR

), underscoring the need to optimize directly for

rather than proxy metrics. Overall, the work demonstrates a practical, backend-agnostic path to robust token-based speech processing with lightweight frontends and suggests promising directions for joint optimization with token ASR and broader speech tasks.

Abstract

Paper Structure (14 sections, 3 figures, 2 tables)

This paper contains 14 sections, 3 figures, 2 tables.

Introduction
Related work
Frontend enhancement for token ASR
Token-to-token enhancement (T2T-E)
Vector-to-token enhancement (V2T-E)
Wave-to-token enhancement (W2T-E)
Experimental Setup
Evaluation metrics and datasets
Models and training
Results
Performance comparison
Relationship between UED and WER
Impact of SSL model depth on W2T-E
Conclusion

Figures (3)

Figure 1: Schematic illustration of the token ASR backend (top) and the four categories of enhancement frontends (bottom) to improve backend noise robustness.
Figure 2: Utterance counts by UED-WER change group (et_simu).
Figure 3: WERs (bars) and UEDs (black lines) as a function of SSL model depth in W2T-E. L0 denotes the output of the convolutional feature encoder of WavLM Large.

Frontend Token Enhancement for Token-Based Speech Recognition

TL;DR

Abstract

Frontend Token Enhancement for Token-Based Speech Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (3)