Table of Contents
Fetching ...

MOSA: Mixture of Simple Adapters Outperforms Monolithic Approaches in LLM-based Multilingual ASR

Junjie Li, Jing Peng, Yangui Fang, Shuai Wang, Kai Yu

TL;DR

This paper tackles multilingual ASR by addressing data scarcity and cross-language parameter interference in LLM-based systems. It introduces MOSA, a Mixture of Simple Adapters MoE projector, where a Router assigns soft weights to multiple lightweight Adapters, enabling a dynamic mixture of language-shared and language-specific representations; the final aligned embedding is formed as $y = \sum_i w_i E_i(x)$ with $w = \mathrm{SoftMax}(G(x))$. With simple two-layer adapters and a frozen Whisper encoder plus a fixed LLM, MOSA achieves a 15.4% relative reduction in $WER$ over the Ideal-LLM Base and outperforms baselines across eight languages while using only 60% of the parameters. Ablation studies show that multiple adapters better capture language-specific and shared knowledge, and that the approach remains robust under severe data imbalance, particularly benefiting low-resource languages. These results suggest that a mixture of simple adapters is more effective than a monolithic, complex projector for LLM-based multilingual ASR, with strong practical implications for parameter efficiency and cross-language transfer.

Abstract

LLM-based ASR overcomes multilingual data scarcity by projecting speech representations into the LLM space to leverage its robust semantic and reasoning capabilities. However, while previous approaches typically enhance performance by scaling data or model parameters, a single projector often struggles to effectively align representations across different languages. In this work, we propose an MoE-based projector named MOSA (Mixture of Simple Adapters). By aggregating multiple simple adapters, this architecture enables different experts to specialize in learning either language-shared or language-specific knowledge. This approach not only mitigates parameter interference between languages but also facilitates positive transfer from high-resource to low-resource languages, effectively alleviating data scarcity issues. Experimental results demonstrate that MOSA-Base achieves a 15.4% relative reduction in average WER compared to the Ideal-LLM Base, consistently outperforming it across all languages. Notably, MOSA achieves a 13.3% WER reduction over the Ideal-LLM Base while utilizing only 60% of its parameters. These findings highlight MOSA's superior parameter efficiency and robustness against data imbalance, suggesting that a mixture of simple adapters is more suitable for multilingual LLM-based ASR than complex single-adapter designs.

MOSA: Mixture of Simple Adapters Outperforms Monolithic Approaches in LLM-based Multilingual ASR

TL;DR

This paper tackles multilingual ASR by addressing data scarcity and cross-language parameter interference in LLM-based systems. It introduces MOSA, a Mixture of Simple Adapters MoE projector, where a Router assigns soft weights to multiple lightweight Adapters, enabling a dynamic mixture of language-shared and language-specific representations; the final aligned embedding is formed as with . With simple two-layer adapters and a frozen Whisper encoder plus a fixed LLM, MOSA achieves a 15.4% relative reduction in over the Ideal-LLM Base and outperforms baselines across eight languages while using only 60% of the parameters. Ablation studies show that multiple adapters better capture language-specific and shared knowledge, and that the approach remains robust under severe data imbalance, particularly benefiting low-resource languages. These results suggest that a mixture of simple adapters is more effective than a monolithic, complex projector for LLM-based multilingual ASR, with strong practical implications for parameter efficiency and cross-language transfer.

Abstract

LLM-based ASR overcomes multilingual data scarcity by projecting speech representations into the LLM space to leverage its robust semantic and reasoning capabilities. However, while previous approaches typically enhance performance by scaling data or model parameters, a single projector often struggles to effectively align representations across different languages. In this work, we propose an MoE-based projector named MOSA (Mixture of Simple Adapters). By aggregating multiple simple adapters, this architecture enables different experts to specialize in learning either language-shared or language-specific knowledge. This approach not only mitigates parameter interference between languages but also facilitates positive transfer from high-resource to low-resource languages, effectively alleviating data scarcity issues. Experimental results demonstrate that MOSA-Base achieves a 15.4% relative reduction in average WER compared to the Ideal-LLM Base, consistently outperforming it across all languages. Notably, MOSA achieves a 13.3% WER reduction over the Ideal-LLM Base while utilizing only 60% of its parameters. These findings highlight MOSA's superior parameter efficiency and robustness against data imbalance, suggesting that a mixture of simple adapters is more suitable for multilingual LLM-based ASR than complex single-adapter designs.

Paper Structure

This paper contains 11 sections, 3 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Model Architecture. The speech encoder extracts features from speech. Adapters map these features into the LLM input space, guided by a Router that dynamically weights their outputs. The LLM then performs speech recognition based on the aligned representation and instruction.
  • Figure 2: T-SNE visualization of aligned speech embeddings.
  • Figure 3: Adapter weight distribution across languages.