Zero-resource Speech Translation and Recognition with LLMs
Karel Mundnich, Xing Niu, Prashant Mathur, Srikanth Ronanki, Brady Houston, Veera Raghavendra Elluru, Nilaksh Das, Zejiang Hou, Goeric Huybrechts, Anshu Bhatia, Daniel Garcia-Romero, Kyu J. Han, Katrin Kirchhoff
TL;DR
The paper tackles zero-resource speech translation and ASR by bridging unlabeled audio and text through a pretrained multilingual speech encoder, a multilingual LLM, and a lightweight CNN adapter that maps audio representations into the LLM's token space. It studies multi-task versus single-task training, scales across model sizes (mT0-XL/XXL), and demonstrates BLEU scores over 23 on unseen languages for ST and WERs up to 28.2% for ASR on CoVoST2/test languages, highlighting the LLM's language generation ability as a key bottleneck. Through extensive datasets (CoVoST2, Europarl-ST, Common Voice, VoxPopuli, FLEURS, LibriVox-MLS) and ablations on CNN pretraining and LoRA adaptation, the work shows that carefully trained adapters plus scale and data diversity drive zero-resource transfer, with significant impact from language confusions and output language accuracy. The findings indicate promising practicality for deploying zero-resource speech technologies in multilingual settings, while pointing to limits imposed by the LLM's language generation fidelity.
Abstract
Despite recent advancements in speech processing, zero-resource speech translation (ST) and automatic speech recognition (ASR) remain challenging problems. In this work, we propose to leverage a multilingual Large Language Model (LLM) to perform ST and ASR in languages for which the model has never seen paired audio-text data. We achieve this by using a pre-trained multilingual speech encoder, a multilingual LLM, and a lightweight adaptation module that maps the audio representations to the token embedding space of the LLM. We perform several experiments both in ST and ASR to understand how to best train the model and what data has the most impact on performance in previously unseen languages. In ST, our best model is capable to achieve BLEU scores over 23 in CoVoST2 for two previously unseen languages, while in ASR, we achieve WERs of up to 28.2\%. We finally show that the performance of our system is bounded by the ability of the LLM to output text in the desired language.
