Frontend Token Enhancement for Token-Based Speech Recognition
Takanori Ashihara, Shota Horiguchi, Kohei Matsuura, Tsubasa Ochiai, Marc Delcroix
TL;DR
This paper tackles noise robustness in token-based speech recognition by introducing four modular frontend enhancements—W2W-E, T2T-E, V2T-E, and W2T-E—that operate independently of ASR backends. Through CHiME-4 experiments, wave-to-token enhancement (W2T-E) emerges as the strongest approach, often surpassing backends that rely on continuous SSL features, while T2T-E, V2T-E, and W2W-E provide valuable insights into token-domain denoising and efficiency. A key finding is that token-level robustness does not always correlate with token sequence similarity ($UED$), underscoring the need to optimize directly for $WER$ rather than proxy metrics. Overall, the work demonstrates a practical, backend-agnostic path to robust token-based speech processing with lightweight frontends and suggests promising directions for joint optimization with token ASR and broader speech tasks.
Abstract
Discretized representations of speech signals are efficient alternatives to continuous features for various speech applications, including automatic speech recognition (ASR) and speech language models. However, these representations, such as semantic or phonetic tokens derived from clustering outputs of self-supervised learning (SSL) speech models, are susceptible to environmental noise, which can degrade backend task performance. In this work, we introduce a frontend system that estimates clean speech tokens from noisy speech and evaluate it on an ASR backend using semantic tokens. We consider four types of enhancement models based on their input/output domains: wave-to-wave, token-to-token, continuous SSL features-to-token, and wave-to-token. These models are trained independently of ASR backends. Experiments on the CHiME-4 dataset demonstrate that wave-to-token enhancement achieves the best performance among the frontends. Moreover, it mostly outperforms the ASR system based on continuous SSL features.
