Table of Contents
Fetching ...

Teaching a Multilingual Large Language Model to Understand Multilingual Speech via Multi-Instructional Training

Pavel Denisov, Ngoc Thang Vu

TL;DR

This work confronts the difficulty of extending multilingual LLMs to speech by fusing a multilingual LLM (BLOOMZ) with a pretrained speech encoder (MMS) via a learnable Adaptor. It introduces a two-stage training regime and a multi-instructional framework that transfers linguistic knowledge from text to speech, achieving broad zero-shot capabilities across ASR, SLT, and spoken language understanding in 139 languages. The results show that MI targets improve non-ASR tasks, while combining transcription with MI (TMI) yields robust cross-task performance, albeit with some limitations tied to the underlying pretrained models. The approach advances cross-modal knowledge transfer and points to practical impacts in low-resource multilingual speech processing and multimodal AI systems, while highlighting areas for regularization and expansion of instruction sets.

Abstract

Recent advancements in language modeling have led to the emergence of Large Language Models (LLMs) capable of various natural language processing tasks. Despite their success in text-based tasks, applying LLMs to the speech domain remains limited and challenging. This paper presents BLOOMZMMS, a novel model that integrates a multilingual LLM with a multilingual speech encoder, aiming to harness the capabilities of LLMs for speech recognition and beyond. Utilizing a multi-instructional training approach, we demonstrate the transferability of linguistic knowledge from the text to the speech modality. Our experiments, conducted on 1900 hours of transcribed data from 139 languages, establish that a multilingual speech representation can be effectively learned and aligned with a multilingual LLM. While this learned representation initially shows limitations in task generalization, we address this issue by generating synthetic targets in a multi-instructional style. Our zero-shot evaluation results confirm the robustness of our approach across multiple tasks, including speech translation and multilingual spoken language understanding, thereby opening new avenues for applying LLMs in the speech domain.

Teaching a Multilingual Large Language Model to Understand Multilingual Speech via Multi-Instructional Training

TL;DR

This work confronts the difficulty of extending multilingual LLMs to speech by fusing a multilingual LLM (BLOOMZ) with a pretrained speech encoder (MMS) via a learnable Adaptor. It introduces a two-stage training regime and a multi-instructional framework that transfers linguistic knowledge from text to speech, achieving broad zero-shot capabilities across ASR, SLT, and spoken language understanding in 139 languages. The results show that MI targets improve non-ASR tasks, while combining transcription with MI (TMI) yields robust cross-task performance, albeit with some limitations tied to the underlying pretrained models. The approach advances cross-modal knowledge transfer and points to practical impacts in low-resource multilingual speech processing and multimodal AI systems, while highlighting areas for regularization and expansion of instruction sets.

Abstract

Recent advancements in language modeling have led to the emergence of Large Language Models (LLMs) capable of various natural language processing tasks. Despite their success in text-based tasks, applying LLMs to the speech domain remains limited and challenging. This paper presents BLOOMZMMS, a novel model that integrates a multilingual LLM with a multilingual speech encoder, aiming to harness the capabilities of LLMs for speech recognition and beyond. Utilizing a multi-instructional training approach, we demonstrate the transferability of linguistic knowledge from the text to the speech modality. Our experiments, conducted on 1900 hours of transcribed data from 139 languages, establish that a multilingual speech representation can be effectively learned and aligned with a multilingual LLM. While this learned representation initially shows limitations in task generalization, we address this issue by generating synthetic targets in a multi-instructional style. Our zero-shot evaluation results confirm the robustness of our approach across multiple tasks, including speech translation and multilingual spoken language understanding, thereby opening new avenues for applying LLMs in the speech domain.
Paper Structure (17 sections, 3 equations, 4 figures, 6 tables)

This paper contains 17 sections, 3 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Overview of the Adaptor training. At the stage one, the Adaptor parameters are optimized using the CTC loss to directly predict the transcription (a). At the stage two, the Adaptor parameters are optimized using the CE loss applied to the outputs of the LLM while the Adaptor output is enclosed in the prompt's prefix and postfix text and is fed to the LLM input. A prompt can instruct the model to generate a transcription (b) or perform some other task on the speech input (c). In the case of transcription, a ground truth transcription is used as a training target. In the case of other instructions, a training target is obtained by running the LLM inference with the same prompt and ground truth transcription as the input.
  • Figure 2: Comparative evaluation of speech recognition performance between the BLOOMZMMS TMI model and previous works, multi-domain MMS (1B) pratap2023scaling and Whisper large-v2 radford2023robust, on the FLEURS-54 evaluation dataset. All numbers are WER except for Thai, Lao, Burmese, and Khmer languages.
  • Figure 3: Comparative evaluation of speech translation performance between the BLOOMZMMS TMI model and previous works, XLS-R/mBART babu2021xls and Whisper large-v2 radford2023robust, on the CoVoST 2 X$\rightarrow$En evaluation set.
  • Figure 4: Cosine similarity between the text and speech embeddings for two FLEURS evaluation utterances. Rows correspond to French and Finnish languages (seen and unseen by BLOOM). Columns represent the T, MI and TMI models.