Table of Contents
Fetching ...

ViMedCSS: A Vietnamese Medical Code-Switching Speech Dataset & Benchmark

Tung X. Nguyen, Nhu Vo, Giang-Son Nguyen, Duy Mai Hoang, Chien Dinh Huynh, Inigo Jauregi Unanue, Massimo Piccardi, Wray Buntine, Dung D. Le

TL;DR

ViMedCSS introduces the first public benchmark for Vietnamese medical code-switching ASR, addressing the challenge of English medical terms embedded in Vietnamese speech. The authors create a 34.57-hour corpus with 16,576 utterances across five medical topics, including a hard split for unseen terms, and evaluate both zero-shot baselines and targeted fine-tuning strategies. They demonstrate a clear trade-off: Vietnamese-specific models excel on overall speech, while multilingual models better capture embedded English terms; the strongest performance arises from parameter-efficient adaptation with language-identity guidance on a Vietnamese-optimized backbone. This work provides a practical, domain-focused dataset and actionable insights for improving CS term recognition in low-resource, multilingual ASR systems, with broad implications for medical transcription and multilingual clinical workflows.

Abstract

Code-switching (CS), which is when Vietnamese speech uses English words like drug names or procedures, is a common phenomenon in Vietnamese medical communication. This creates challenges for Automatic Speech Recognition (ASR) systems, especially in low-resource languages like Vietnamese. Current most ASR systems struggle to recognize correctly English medical terms within Vietnamese sentences, and no benchmark addresses this challenge. In this paper, we construct a 34-hour \textbf{Vi}etnamese \textbf{Med}ical \textbf{C}ode-\textbf{S}witching \textbf{S}peech dataset (ViMedCSS) containing 16,576 utterances. Each utterance includes at least one English medical term drawn from a curated bilingual lexicon covering five medical topics. Using this dataset, we evaluate several state-of-the-art ASR models and examine different specific fine-tuning strategies for improving medical term recognition to investigate the best approach to solve in the dataset. Experimental results show that Vietnamese-optimized models perform better on general segments, while multilingual pretraining helps capture English insertions. The combination of both approaches yields the best balance between overall and code-switched accuracy. This work provides the first benchmark for Vietnamese medical code-switching and offers insights into effective domain adaptation for low-resource, multilingual ASR systems.

ViMedCSS: A Vietnamese Medical Code-Switching Speech Dataset & Benchmark

TL;DR

ViMedCSS introduces the first public benchmark for Vietnamese medical code-switching ASR, addressing the challenge of English medical terms embedded in Vietnamese speech. The authors create a 34.57-hour corpus with 16,576 utterances across five medical topics, including a hard split for unseen terms, and evaluate both zero-shot baselines and targeted fine-tuning strategies. They demonstrate a clear trade-off: Vietnamese-specific models excel on overall speech, while multilingual models better capture embedded English terms; the strongest performance arises from parameter-efficient adaptation with language-identity guidance on a Vietnamese-optimized backbone. This work provides a practical, domain-focused dataset and actionable insights for improving CS term recognition in low-resource, multilingual ASR systems, with broad implications for medical transcription and multilingual clinical workflows.

Abstract

Code-switching (CS), which is when Vietnamese speech uses English words like drug names or procedures, is a common phenomenon in Vietnamese medical communication. This creates challenges for Automatic Speech Recognition (ASR) systems, especially in low-resource languages like Vietnamese. Current most ASR systems struggle to recognize correctly English medical terms within Vietnamese sentences, and no benchmark addresses this challenge. In this paper, we construct a 34-hour \textbf{Vi}etnamese \textbf{Med}ical \textbf{C}ode-\textbf{S}witching \textbf{S}peech dataset (ViMedCSS) containing 16,576 utterances. Each utterance includes at least one English medical term drawn from a curated bilingual lexicon covering five medical topics. Using this dataset, we evaluate several state-of-the-art ASR models and examine different specific fine-tuning strategies for improving medical term recognition to investigate the best approach to solve in the dataset. Experimental results show that Vietnamese-optimized models perform better on general segments, while multilingual pretraining helps capture English insertions. The combination of both approaches yields the best balance between overall and code-switched accuracy. This work provides the first benchmark for Vietnamese medical code-switching and offers insights into effective domain adaptation for low-resource, multilingual ASR systems.
Paper Structure (15 sections, 2 figures, 7 tables)

This paper contains 15 sections, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Dataset construction pipeline for Vietnamese medical code-switching.
  • Figure 2: Histogram of utterance durations.