The Impact of Code-switched Synthetic Data Quality is Task Dependent: Insights from MT and ASR
Injy Hamed, Ngoc Thang Vu, Nizar Habash
TL;DR
This study tackles code-switched data augmentation by evaluating a wide range of techniques (lexical replacements, linguistic theories, back-translation) across three downstream tasks: MT, ASR, and cascaded ST. It extends previous MT-focused findings to ASR and ST to assess generalizability and task-dependence of synthetic data quality effects. The results show that back-translation and predictive lexical replacements yield the most consistent gains across tasks, while MT benefits correlate strongly with human-perceived naturalness, unlike ASR. The work highlights that data diversity and task complexity strongly influence outcomes and suggests future exploration of large language model-based CSW generation and personalization for improved robustness.
Abstract
Code-switching, the act of alternating between languages, emerged as a prevalent global phenomenon that needs to be addressed for building user-friendly language technologies. A main bottleneck in this pursuit is data scarcity, motivating research in the direction of code-switched data augmentation. However, current literature lacks comprehensive studies that enable us to understand the relation between the quality of synthetic data and improvements on NLP tasks. We extend previous research conducted in this direction on machine translation (MT) with results on automatic speech recognition (ASR) and cascaded speech translation (ST) to test generalizability of findings. Our experiments involve a wide range of augmentation techniques, covering lexical replacements, linguistic theories, and back-translation. Based on the results of MT, ASR, and ST, we draw conclusions and insights regarding the efficacy of various augmentation techniques and the impact of quality on performance.
