CantoASR: Prosody-Aware ASR-LALM Collaboration for Low-Resource Cantonese
Dazhong Chen, Yi-Cheng Lin, Yuchen Huang, Ziwei Gong, Di Jiang, Zeying Xie, Yi R., Fung
TL;DR
CantoASR tackles low-resource Cantonese ASR by integrating acoustic prosody with LALM reasoning through a four-stage pipeline: forced alignment-based acoustic feature extraction, LoRA-finetuned Whisper for tone-sensitive ASR, tonal instruction tuning that maps acoustic cues to phonological rules, and Qwen2-Audio-based prosody-aware correction with constrained decoding and semantic validation. The approach directly links $F_0$, slope, and duration to tonal categories, enabling robust tone disambiguation and accent adaptation without manual tonal labels. Across CV, MCE, and MDCC, CantoASR achieves a best average $CER$ of $11.19\%$, outperforming strong audio-conditioned baselines (e.g., $Qwen2-Audio-7B$ at $14.5\%$) and benefiting from ablations that show complementary gains from ASR finetuning, tonal instruction tuning, and Cantonese semantic validation. This work demonstrates a scalable strategy for low-resource tonal languages and provides resources to support reproducibility and extension to related languages such as Hokkien and Vietnamese, with potential impact on accessibility and dialectal NLP.
Abstract
Automatic speech recognition (ASR) is critical for language accessibility, yet low-resource Cantonese remains challenging due to limited annotated data, six lexical tones, tone sandhi, and accent variation. Existing ASR models, such as Whisper, often suffer from high word error rates. Large audio-language models (LALMs), in contrast, can leverage broader contextual reasoning but still require explicit tonal and prosodic acoustic cues. We introduce CantoASR, a collaborative ASR-LALM error correction framework that integrates forced alignment for acoustic feature extraction, a LoRA-finetuned Whisper for improved tone discrimination, and an instruction-tuned Qwen-Audio for prosody-aware correction. Evaluations on spontaneous Cantonese data show substantial CER gains over Whisper-Large-V3. These findings suggest that integrating acoustic cues with LALM reasoning provides a scalable strategy for low-resource tonal and dialectal ASR.
