MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders

Wenyu Zhang; Shuo Sun; Bin Wang; Xunlong Zou; Zhuohan Liu; Yingxu He; Geyu Lin; Nancy F. Chen; Ai Ti Aw

MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders

Wenyu Zhang, Shuo Sun, Bin Wang, Xunlong Zou, Zhuohan Liu, Yingxu He, Geyu Lin, Nancy F. Chen, Ai Ti Aw

TL;DR

MoWE introduces a mixture of weak encoders to augment AudioLLMs, combining a strong base encoder with lightweight encoders routed by data-independent and data-dependent mechanisms. The framework concatenates weak-encoder embeddings with base features and fine-tunes a lightweight LLM, optimizing with $L = L_{\text{next-token}} + 0.1 \cdot L_{\text{MoWE}}$ to encourage selective encoder usage. Empirical results across five tasks show improved multitask performance, with both uniform and diverse encoder pools yielding benefits, and competitive results relative to state-of-the-art, even under limited data or model scale. The approach demonstrates encoder specialization, robustness to out-of-distribution data, and practical efficiency, suggesting a viable path to broader, more capable AudioLLMs in real-world applications.

Abstract

The rapid advancements in large language models (LLMs) have significantly enhanced natural language processing capabilities, facilitating the development of AudioLLMs that process and understand speech and audio inputs alongside text. Existing AudioLLMs typically combine a pre-trained audio encoder with a pre-trained LLM, which are subsequently finetuned on specific audio tasks. However, the pre-trained audio encoder has constrained capacity to capture features for new tasks and datasets. To address this, we propose to incorporate mixtures of `weak' encoders (MoWE) into the AudioLLM framework. MoWE supplements a base encoder with a pool of relatively light weight encoders, selectively activated based on the audio input to enhance feature extraction without significantly increasing model size. Our empirical results demonstrate that MoWE effectively improves multi-task performance, broadening the applicability of AudioLLMs to more diverse audio tasks.

MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders

TL;DR

to encourage selective encoder usage. Empirical results across five tasks show improved multitask performance, with both uniform and diverse encoder pools yielding benefits, and competitive results relative to state-of-the-art, even under limited data or model scale. The approach demonstrates encoder specialization, robustness to out-of-distribution data, and practical efficiency, suggesting a viable path to broader, more capable AudioLLMs in real-world applications.

Abstract

Paper Structure (16 sections, 5 equations, 2 figures, 8 tables)

This paper contains 16 sections, 5 equations, 2 figures, 8 tables.

Introduction
Related Works
AudioLLMs
Mixture of Experts
Proposed Method
AudioLLM Framework
Mixture of Weak Encoders
Experimental Setup
Tasks and Datasets
Implementation
Results and Analysis
Mixture of Uniform Encoders
Mixture of Diverse Encoders
Comparison with Models with Large-Scale Training
Further Analysis
...and 1 more sections

Figures (2)

Figure 1: Overview of proposed strategy: (a) AudioLLM framework consists a pool of weak encoders to supplement the strong base encoder in new datasets and tasks, (b) MoWE utilizes a data-independent and a data-dependent router to selectively activate encoders for audio processing.
Figure 2: Proportion of samples assigned to each encoder by the data-dependent router. Note that HuBERT-base-ER is activated by the data-independent router for all samples.

MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders

TL;DR

Abstract

MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders

Authors

TL;DR

Abstract

Table of Contents

Figures (2)