Table of Contents
Fetching ...

Weakly Supervised Data Refinement and Flexible Sequence Compression for Efficient Thai LLM-based ASR

Mingchen Shao, Xinfa Zhu, Chengyou Wang, Bingshen Mu, Hai Li, Ying Yan, Junhui Liu, Danming Xie, Lei Xie

TL;DR

The paper tackles Thai ASR in data- and compute-constrained settings by integrating a strong speech encoder with a Thai LLM through a two-stage training scheme. It introduces a self-evolving data refinement loop to improve weak labels and yields refined 16k hours of Thai data, then couples an enhanced Zipformer encoder with Typhoon2-Llama3 backbones via adapters and LoRA for SOTA CER reductions. A pluggable cosine-based sequence compression reduces sequence length by up to $50\%$ for inference and training with modest CER loss ($<5\%$), achieving $1.5\times$ to $2.1\times$ speedups. Cross-dataset experiments on Gigaspeech2, MSR-86k, CommonVoice Thai, and FLEUR show robust gains and practical acceleration, validating the approach for low-resource Thai ASR. The work contributes open refined labels, a scalable LLM-integrated ASR architecture, and a flexible compression module enabling efficient deployment.

Abstract

Despite remarkable achievements, automatic speech recognition (ASR) in low-resource scenarios still faces two challenges: high-quality data scarcity and high computational demands. This paper proposes EThai-ASR, the first to apply large language models (LLMs) to Thai ASR and create an efficient LLM-based ASR system. EThai-ASR comprises a speech encoder, a connection module and a Thai LLM decoder. To address the data scarcity and obtain a powerful speech encoder, EThai-ASR introduces a self-evolving data refinement strategy to refine weak labels, yielding an enhanced speech encoder. Moreover, we propose a pluggable sequence compression module used in the connection module with three modes designed to reduce the sequence length, thus decreasing computational demands while maintaining decent performance. Extensive experiments demonstrate that EThai-ASR has achieved state-of-the-art accuracy in multiple datasets. We release our refined text transcripts to promote further research.

Weakly Supervised Data Refinement and Flexible Sequence Compression for Efficient Thai LLM-based ASR

TL;DR

The paper tackles Thai ASR in data- and compute-constrained settings by integrating a strong speech encoder with a Thai LLM through a two-stage training scheme. It introduces a self-evolving data refinement loop to improve weak labels and yields refined 16k hours of Thai data, then couples an enhanced Zipformer encoder with Typhoon2-Llama3 backbones via adapters and LoRA for SOTA CER reductions. A pluggable cosine-based sequence compression reduces sequence length by up to for inference and training with modest CER loss (), achieving to speedups. Cross-dataset experiments on Gigaspeech2, MSR-86k, CommonVoice Thai, and FLEUR show robust gains and practical acceleration, validating the approach for low-resource Thai ASR. The work contributes open refined labels, a scalable LLM-integrated ASR architecture, and a flexible compression module enabling efficient deployment.

Abstract

Despite remarkable achievements, automatic speech recognition (ASR) in low-resource scenarios still faces two challenges: high-quality data scarcity and high computational demands. This paper proposes EThai-ASR, the first to apply large language models (LLMs) to Thai ASR and create an efficient LLM-based ASR system. EThai-ASR comprises a speech encoder, a connection module and a Thai LLM decoder. To address the data scarcity and obtain a powerful speech encoder, EThai-ASR introduces a self-evolving data refinement strategy to refine weak labels, yielding an enhanced speech encoder. Moreover, we propose a pluggable sequence compression module used in the connection module with three modes designed to reduce the sequence length, thus decreasing computational demands while maintaining decent performance. Extensive experiments demonstrate that EThai-ASR has achieved state-of-the-art accuracy in multiple datasets. We release our refined text transcripts to promote further research.

Paper Structure

This paper contains 15 sections, 2 equations, 2 figures, 3 tables, 1 algorithm.

Figures (2)

  • Figure 1: The architecture of EThai-ASR. The left panel illustrates redundant speech frames are dynamically removed via cosine similarity to reduce sequence length before LLM processing. The right panel illustrates three integration modes: in from-scratch training, the Transformer adapter, linear projector, and LLM LoRA are finetuned after redundancy removal; efficient finetuning adjusts only the linear projector for comparable performance with reduced overhead; inference applies redundancy removal without any parameter training.
  • Figure 2: Impact of cosine similarity thresholds on SR and CER retention. The red line represents the variation in SR, while the blue line illustrates the changes in CER Retention.