Table of Contents
Fetching ...

CantoASR: Prosody-Aware ASR-LALM Collaboration for Low-Resource Cantonese

Dazhong Chen, Yi-Cheng Lin, Yuchen Huang, Ziwei Gong, Di Jiang, Zeying Xie, Yi R., Fung

TL;DR

CantoASR tackles low-resource Cantonese ASR by integrating acoustic prosody with LALM reasoning through a four-stage pipeline: forced alignment-based acoustic feature extraction, LoRA-finetuned Whisper for tone-sensitive ASR, tonal instruction tuning that maps acoustic cues to phonological rules, and Qwen2-Audio-based prosody-aware correction with constrained decoding and semantic validation. The approach directly links $F_0$, slope, and duration to tonal categories, enabling robust tone disambiguation and accent adaptation without manual tonal labels. Across CV, MCE, and MDCC, CantoASR achieves a best average $CER$ of $11.19\%$, outperforming strong audio-conditioned baselines (e.g., $Qwen2-Audio-7B$ at $14.5\%$) and benefiting from ablations that show complementary gains from ASR finetuning, tonal instruction tuning, and Cantonese semantic validation. This work demonstrates a scalable strategy for low-resource tonal languages and provides resources to support reproducibility and extension to related languages such as Hokkien and Vietnamese, with potential impact on accessibility and dialectal NLP.

Abstract

Automatic speech recognition (ASR) is critical for language accessibility, yet low-resource Cantonese remains challenging due to limited annotated data, six lexical tones, tone sandhi, and accent variation. Existing ASR models, such as Whisper, often suffer from high word error rates. Large audio-language models (LALMs), in contrast, can leverage broader contextual reasoning but still require explicit tonal and prosodic acoustic cues. We introduce CantoASR, a collaborative ASR-LALM error correction framework that integrates forced alignment for acoustic feature extraction, a LoRA-finetuned Whisper for improved tone discrimination, and an instruction-tuned Qwen-Audio for prosody-aware correction. Evaluations on spontaneous Cantonese data show substantial CER gains over Whisper-Large-V3. These findings suggest that integrating acoustic cues with LALM reasoning provides a scalable strategy for low-resource tonal and dialectal ASR.

CantoASR: Prosody-Aware ASR-LALM Collaboration for Low-Resource Cantonese

TL;DR

CantoASR tackles low-resource Cantonese ASR by integrating acoustic prosody with LALM reasoning through a four-stage pipeline: forced alignment-based acoustic feature extraction, LoRA-finetuned Whisper for tone-sensitive ASR, tonal instruction tuning that maps acoustic cues to phonological rules, and Qwen2-Audio-based prosody-aware correction with constrained decoding and semantic validation. The approach directly links , slope, and duration to tonal categories, enabling robust tone disambiguation and accent adaptation without manual tonal labels. Across CV, MCE, and MDCC, CantoASR achieves a best average of , outperforming strong audio-conditioned baselines (e.g., at ) and benefiting from ablations that show complementary gains from ASR finetuning, tonal instruction tuning, and Cantonese semantic validation. This work demonstrates a scalable strategy for low-resource tonal languages and provides resources to support reproducibility and extension to related languages such as Hokkien and Vietnamese, with potential impact on accessibility and dialectal NLP.

Abstract

Automatic speech recognition (ASR) is critical for language accessibility, yet low-resource Cantonese remains challenging due to limited annotated data, six lexical tones, tone sandhi, and accent variation. Existing ASR models, such as Whisper, often suffer from high word error rates. Large audio-language models (LALMs), in contrast, can leverage broader contextual reasoning but still require explicit tonal and prosodic acoustic cues. We introduce CantoASR, a collaborative ASR-LALM error correction framework that integrates forced alignment for acoustic feature extraction, a LoRA-finetuned Whisper for improved tone discrimination, and an instruction-tuned Qwen-Audio for prosody-aware correction. Evaluations on spontaneous Cantonese data show substantial CER gains over Whisper-Large-V3. These findings suggest that integrating acoustic cues with LALM reasoning provides a scalable strategy for low-resource tonal and dialectal ASR.

Paper Structure

This paper contains 16 sections, 1 equation, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Overview of CantoASR. The pipeline integrates prosody cues (F0, slope, duration) into LALM ASR error correction to build a tonal instruction-tuning dataset, and leverages ASR error patterns to build a Cantonese correction dataset.