Table of Contents
Fetching ...

Parameter-efficient Adaptation of Multilingual Multimodal Models for Low-resource ASR

Abhishek Gupta, Amruta Parulekar, Sameep Chattopadhyay, Preethi Jyothi

TL;DR

This work investigates how parameter-efficient fine-tuning and text-only adaptation can be effectively combined using a multilingual multimodal model like SeamlessM4T, able to leverage unlabeled text via text-only adaptation with further parameter-efficient ASR fine-tuning, thus boosting ASR performance.

Abstract

Automatic speech recognition (ASR) for low-resource languages remains a challenge due to the scarcity of labeled training data. Parameter-efficient fine-tuning and text-only adaptation are two popular methods that have been used to address such low-resource settings. In this work, we investigate how these techniques can be effectively combined using a multilingual multimodal model like SeamlessM4T. Multimodal models are able to leverage unlabeled text via text-only adaptation with further parameter-efficient ASR fine-tuning, thus boosting ASR performance. We also show cross-lingual transfer from a high-resource language, achieving up to a relative 17% WER reduction over a baseline in a zero-shot setting without any labeled speech.

Parameter-efficient Adaptation of Multilingual Multimodal Models for Low-resource ASR

TL;DR

This work investigates how parameter-efficient fine-tuning and text-only adaptation can be effectively combined using a multilingual multimodal model like SeamlessM4T, able to leverage unlabeled text via text-only adaptation with further parameter-efficient ASR fine-tuning, thus boosting ASR performance.

Abstract

Automatic speech recognition (ASR) for low-resource languages remains a challenge due to the scarcity of labeled training data. Parameter-efficient fine-tuning and text-only adaptation are two popular methods that have been used to address such low-resource settings. In this work, we investigate how these techniques can be effectively combined using a multilingual multimodal model like SeamlessM4T. Multimodal models are able to leverage unlabeled text via text-only adaptation with further parameter-efficient ASR fine-tuning, thus boosting ASR performance. We also show cross-lingual transfer from a high-resource language, achieving up to a relative 17% WER reduction over a baseline in a zero-shot setting without any labeled speech.

Paper Structure

This paper contains 20 sections, 4 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Parameter-efficient Adaptations for SeamlessM4T: A multimodal ASR model such as SeamlessM4T can be fine-tuned in a parameter-efficient manner through either speech-based adaptations or text-only adaptation.
  • Figure 2: SeamlessM4T Length Adapter: Projects speech embedding $X$ to a lower-dimensional representation $\tilde{X}$ in the multimodal space.