Table of Contents
Fetching ...

CoVoSwitch: Machine Translation of Synthetic Code-Switched Text Based on Intonation Units

Yeeun Kang

TL;DR

This paper introduces CoVoSwitch, a synthetic code-switching dataset built by replacing English intonation units with non-English segments detected via PSST on CoVoST 2. It evaluates two multilingual MT models, M2M-100 and NLLB-200, across 13 language pairs to compare code-switched translations with monolingual baselines and with raw code-switched inputs. The findings show that code-switching units often improve translation into English, with low-resource languages benefiting most in csw→En, while translations into non-English targets are more challenging and prone to off-target and hallucination phenomena. By releasing CoVoSwitch, the work broadens language representation for CSW research and highlights both the potential and limitations of prosodically informed code-switch synthesis for MT.

Abstract

Multilingual code-switching research is often hindered by the lack and linguistically biased status of available datasets. To expand language representation, we synthesize code-switching data by replacing intonation units detected through PSST, a speech segmentation model fine-tuned from OpenAI's Whisper, using a speech-to-text translation dataset, CoVoST 2. With our dataset, CoVoSwitch, spanning 13 languages, we evaluate the code-switching translation performance of two multilingual translation models, M2M-100 418M and NLLB-200 600M. We reveal that the inclusion of code-switching units results in higher translation performance than monolingual settings and that models are better at code-switching translation into English than non-English. Further, low-resource languages gain most from integration of code-switched units when translating into English but much less when translating into non-English. Translations into low-resource languages also perform worse than even raw code-switched inputs. We find that systems excel at copying English tokens but struggle with non-English tokens, that the off-target problem in monolingual settings is also relevant in code-switching settings, and that models hallucinate in code-switching translation by introducing words absent in both of the original source sentences. CoVoSwitch and code are available at https://github.com/sophiayk20/covoswitch.

CoVoSwitch: Machine Translation of Synthetic Code-Switched Text Based on Intonation Units

TL;DR

This paper introduces CoVoSwitch, a synthetic code-switching dataset built by replacing English intonation units with non-English segments detected via PSST on CoVoST 2. It evaluates two multilingual MT models, M2M-100 and NLLB-200, across 13 language pairs to compare code-switched translations with monolingual baselines and with raw code-switched inputs. The findings show that code-switching units often improve translation into English, with low-resource languages benefiting most in csw→En, while translations into non-English targets are more challenging and prone to off-target and hallucination phenomena. By releasing CoVoSwitch, the work broadens language representation for CSW research and highlights both the potential and limitations of prosodically informed code-switch synthesis for MT.

Abstract

Multilingual code-switching research is often hindered by the lack and linguistically biased status of available datasets. To expand language representation, we synthesize code-switching data by replacing intonation units detected through PSST, a speech segmentation model fine-tuned from OpenAI's Whisper, using a speech-to-text translation dataset, CoVoST 2. With our dataset, CoVoSwitch, spanning 13 languages, we evaluate the code-switching translation performance of two multilingual translation models, M2M-100 418M and NLLB-200 600M. We reveal that the inclusion of code-switching units results in higher translation performance than monolingual settings and that models are better at code-switching translation into English than non-English. Further, low-resource languages gain most from integration of code-switched units when translating into English but much less when translating into non-English. Translations into low-resource languages also perform worse than even raw code-switched inputs. We find that systems excel at copying English tokens but struggle with non-English tokens, that the off-target problem in monolingual settings is also relevant in code-switching settings, and that models hallucinate in code-switching translation by introducing words absent in both of the original source sentences. CoVoSwitch and code are available at https://github.com/sophiayk20/covoswitch.
Paper Structure (17 sections, 6 figures, 12 tables)

This paper contains 17 sections, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Our code-switching data generation pipeline with an example of English and Catalan parallel corpora.
  • Figure 2: Example translation output in Catalan-English and Welsh-English for csw$\rightarrow$En task.
  • Figure 3: Replacement rates plotted against spBLEU deltas. Correlation $\rho$ in the upper right corner is measured with Spearman's coefficient.
  • Figure 4: Repeated words in csw$\rightarrow$X.
  • Figure 5: Off-target problem, changed meaning, and repeated combinations of characters in csw$\rightarrow$X.
  • ...and 1 more figures