Weakly Supervised Data Refinement and Flexible Sequence Compression for Efficient Thai LLM-based ASR
Mingchen Shao, Xinfa Zhu, Chengyou Wang, Bingshen Mu, Hai Li, Ying Yan, Junhui Liu, Danming Xie, Lei Xie
TL;DR
The paper tackles Thai ASR in data- and compute-constrained settings by integrating a strong speech encoder with a Thai LLM through a two-stage training scheme. It introduces a self-evolving data refinement loop to improve weak labels and yields refined 16k hours of Thai data, then couples an enhanced Zipformer encoder with Typhoon2-Llama3 backbones via adapters and LoRA for SOTA CER reductions. A pluggable cosine-based sequence compression reduces sequence length by up to $50\%$ for inference and training with modest CER loss ($<5\%$), achieving $1.5\times$ to $2.1\times$ speedups. Cross-dataset experiments on Gigaspeech2, MSR-86k, CommonVoice Thai, and FLEUR show robust gains and practical acceleration, validating the approach for low-resource Thai ASR. The work contributes open refined labels, a scalable LLM-integrated ASR architecture, and a flexible compression module enabling efficient deployment.
Abstract
Despite remarkable achievements, automatic speech recognition (ASR) in low-resource scenarios still faces two challenges: high-quality data scarcity and high computational demands. This paper proposes EThai-ASR, the first to apply large language models (LLMs) to Thai ASR and create an efficient LLM-based ASR system. EThai-ASR comprises a speech encoder, a connection module and a Thai LLM decoder. To address the data scarcity and obtain a powerful speech encoder, EThai-ASR introduces a self-evolving data refinement strategy to refine weak labels, yielding an enhanced speech encoder. Moreover, we propose a pluggable sequence compression module used in the connection module with three modes designed to reduce the sequence length, thus decreasing computational demands while maintaining decent performance. Extensive experiments demonstrate that EThai-ASR has achieved state-of-the-art accuracy in multiple datasets. We release our refined text transcripts to promote further research.
