Table of Contents
Fetching ...

CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving

Bhavani Shankar, Preethi Jyothi, Pushpak Bhattacharyya

TL;DR

CoSTA tackles code-switched spoken translation by bootstrapping from pretrained ASR and MT models and introducing an aligned interleaving strategy to fuse speech and transcription representations before decoding in English. Training uses a tri-task objective with ST, ASR, and MT losses, enabling end-to-end optimization on synthetically generated code-switched ST data. The authors release new code-switched evaluation sets for Bengali-English, Hindi-English, Marathi-English, and Telugu-English, plus podcast and monolingual benchmarks, and demonstrate BLEU gains up to 3.5 points over strong baselines while showing robustness to varying degrees of code-switching. Across extensive ablations, the aligned interleaving and mean-pooling fusion emerge as key drivers of performance, with CoSTA generalizing across cross-domain data such as Kathbath. The work advances practical code-switched ST and provides valuable benchmarks for future research in multilingual speech translation.

Abstract

Code-switching is a widely prevalent linguistic phenomenon in multilingual societies like India. Building speech-to-text models for code-switched speech is challenging due to limited availability of datasets. In this work, we focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text. We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules (that are more widely available for many languages). Speech and ASR text representations are fused using an aligned interleaving scheme and are fed further as input to a pretrained MT module; the whole pipeline is then trained end-to-end for spoken translation using synthetically created ST data. We also release a new evaluation benchmark for code-switched Bengali-English, Hindi-English, Marathi-English and Telugu- English speech to English text. COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.

CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving

TL;DR

CoSTA tackles code-switched spoken translation by bootstrapping from pretrained ASR and MT models and introducing an aligned interleaving strategy to fuse speech and transcription representations before decoding in English. Training uses a tri-task objective with ST, ASR, and MT losses, enabling end-to-end optimization on synthetically generated code-switched ST data. The authors release new code-switched evaluation sets for Bengali-English, Hindi-English, Marathi-English, and Telugu-English, plus podcast and monolingual benchmarks, and demonstrate BLEU gains up to 3.5 points over strong baselines while showing robustness to varying degrees of code-switching. Across extensive ablations, the aligned interleaving and mean-pooling fusion emerge as key drivers of performance, with CoSTA generalizing across cross-domain data such as Kathbath. The work advances practical code-switched ST and provides valuable benchmarks for future research in multilingual speech translation.

Abstract

Code-switching is a widely prevalent linguistic phenomenon in multilingual societies like India. Building speech-to-text models for code-switched speech is challenging due to limited availability of datasets. In this work, we focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text. We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules (that are more widely available for many languages). Speech and ASR text representations are fused using an aligned interleaving scheme and are fed further as input to a pretrained MT module; the whole pipeline is then trained end-to-end for spoken translation using synthetically created ST data. We also release a new evaluation benchmark for code-switched Bengali-English, Hindi-English, Marathi-English and Telugu- English speech to English text. COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
Paper Structure (33 sections, 1 equation, 3 figures, 17 tables)

This paper contains 33 sections, 1 equation, 3 figures, 17 tables.

Figures (3)

  • Figure 1: Model with aligned interleaving, which aligns corresponding speech and text embeddings and interleaves them before passing them through the text encoder (IndicTrans encoder here).
  • Figure 2: To assess the accuracy of code-switched span translation, we evaluate the exact match between the English spans in the reference translation and the predicted translation. This involves identifying the English spans in the code-switched transcript and then comparing these spans with those in the predicted translations. It is important to note that this process is order-dependent.
  • Figure 3: Example generated outputs from the best hindi cascaded model (IndicWav2Vec for ASR combined with IndicTrans for MT, fine-tuned), the best seamless model (Seamless fine-tuned ASR+ST), and CoSTA. Note that error propagation is observed in the cascaded model (highlighted in red), arising from multiple factors: an incorrect transcript in the first example, the English word ready-made being incorrectly transcribed by the Hindi ASR model in the second example, and a machine translation error in the third example. Additionally, the English words uttered in the speech are correctly captured by CoSTA (highlighted in blue), unlike in the cascaded and seamless models.