MOSA: Mixture of Simple Adapters Outperforms Monolithic Approaches in LLM-based Multilingual ASR

Junjie Li; Jing Peng; Yangui Fang; Shuai Wang; Kai Yu

MOSA: Mixture of Simple Adapters Outperforms Monolithic Approaches in LLM-based Multilingual ASR

Junjie Li, Jing Peng, Yangui Fang, Shuai Wang, Kai Yu

TL;DR

This paper tackles multilingual ASR by addressing data scarcity and cross-language parameter interference in LLM-based systems. It introduces MOSA, a Mixture of Simple Adapters MoE projector, where a Router assigns soft weights to multiple lightweight Adapters, enabling a dynamic mixture of language-shared and language-specific representations; the final aligned embedding is formed as $y = \sum_i w_i E_i(x)$ with $w = \mathrm{SoftMax}(G(x))$. With simple two-layer adapters and a frozen Whisper encoder plus a fixed LLM, MOSA achieves a 15.4% relative reduction in $WER$ over the Ideal-LLM Base and outperforms baselines across eight languages while using only 60% of the parameters. Ablation studies show that multiple adapters better capture language-specific and shared knowledge, and that the approach remains robust under severe data imbalance, particularly benefiting low-resource languages. These results suggest that a mixture of simple adapters is more effective than a monolithic, complex projector for LLM-based multilingual ASR, with strong practical implications for parameter efficiency and cross-language transfer.

Abstract

LLM-based ASR overcomes multilingual data scarcity by projecting speech representations into the LLM space to leverage its robust semantic and reasoning capabilities. However, while previous approaches typically enhance performance by scaling data or model parameters, a single projector often struggles to effectively align representations across different languages. In this work, we propose an MoE-based projector named MOSA (Mixture of Simple Adapters). By aggregating multiple simple adapters, this architecture enables different experts to specialize in learning either language-shared or language-specific knowledge. This approach not only mitigates parameter interference between languages but also facilitates positive transfer from high-resource to low-resource languages, effectively alleviating data scarcity issues. Experimental results demonstrate that MOSA-Base achieves a 15.4% relative reduction in average WER compared to the Ideal-LLM Base, consistently outperforming it across all languages. Notably, MOSA achieves a 13.3% WER reduction over the Ideal-LLM Base while utilizing only 60% of its parameters. These findings highlight MOSA's superior parameter efficiency and robustness against data imbalance, suggesting that a mixture of simple adapters is more suitable for multilingual LLM-based ASR than complex single-adapter designs.

MOSA: Mixture of Simple Adapters Outperforms Monolithic Approaches in LLM-based Multilingual ASR

TL;DR

Abstract

MOSA: Mixture of Simple Adapters Outperforms Monolithic Approaches in LLM-based Multilingual ASR

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)