Table of Contents
Fetching ...

MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders

Wenyu Zhang, Shuo Sun, Bin Wang, Xunlong Zou, Zhuohan Liu, Yingxu He, Geyu Lin, Nancy F. Chen, Ai Ti Aw

TL;DR

MoWE introduces a mixture of weak encoders to augment AudioLLMs, combining a strong base encoder with lightweight encoders routed by data-independent and data-dependent mechanisms. The framework concatenates weak-encoder embeddings with base features and fine-tunes a lightweight LLM, optimizing with $L = L_{\text{next-token}} + 0.1 \cdot L_{\text{MoWE}}$ to encourage selective encoder usage. Empirical results across five tasks show improved multitask performance, with both uniform and diverse encoder pools yielding benefits, and competitive results relative to state-of-the-art, even under limited data or model scale. The approach demonstrates encoder specialization, robustness to out-of-distribution data, and practical efficiency, suggesting a viable path to broader, more capable AudioLLMs in real-world applications.

Abstract

The rapid advancements in large language models (LLMs) have significantly enhanced natural language processing capabilities, facilitating the development of AudioLLMs that process and understand speech and audio inputs alongside text. Existing AudioLLMs typically combine a pre-trained audio encoder with a pre-trained LLM, which are subsequently finetuned on specific audio tasks. However, the pre-trained audio encoder has constrained capacity to capture features for new tasks and datasets. To address this, we propose to incorporate mixtures of `weak' encoders (MoWE) into the AudioLLM framework. MoWE supplements a base encoder with a pool of relatively light weight encoders, selectively activated based on the audio input to enhance feature extraction without significantly increasing model size. Our empirical results demonstrate that MoWE effectively improves multi-task performance, broadening the applicability of AudioLLMs to more diverse audio tasks.

MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders

TL;DR

MoWE introduces a mixture of weak encoders to augment AudioLLMs, combining a strong base encoder with lightweight encoders routed by data-independent and data-dependent mechanisms. The framework concatenates weak-encoder embeddings with base features and fine-tunes a lightweight LLM, optimizing with to encourage selective encoder usage. Empirical results across five tasks show improved multitask performance, with both uniform and diverse encoder pools yielding benefits, and competitive results relative to state-of-the-art, even under limited data or model scale. The approach demonstrates encoder specialization, robustness to out-of-distribution data, and practical efficiency, suggesting a viable path to broader, more capable AudioLLMs in real-world applications.

Abstract

The rapid advancements in large language models (LLMs) have significantly enhanced natural language processing capabilities, facilitating the development of AudioLLMs that process and understand speech and audio inputs alongside text. Existing AudioLLMs typically combine a pre-trained audio encoder with a pre-trained LLM, which are subsequently finetuned on specific audio tasks. However, the pre-trained audio encoder has constrained capacity to capture features for new tasks and datasets. To address this, we propose to incorporate mixtures of `weak' encoders (MoWE) into the AudioLLM framework. MoWE supplements a base encoder with a pool of relatively light weight encoders, selectively activated based on the audio input to enhance feature extraction without significantly increasing model size. Our empirical results demonstrate that MoWE effectively improves multi-task performance, broadening the applicability of AudioLLMs to more diverse audio tasks.
Paper Structure (16 sections, 5 equations, 2 figures, 8 tables)

This paper contains 16 sections, 5 equations, 2 figures, 8 tables.

Figures (2)

  • Figure 1: Overview of proposed strategy: (a) AudioLLM framework consists a pool of weak encoders to supplement the strong base encoder in new datasets and tasks, (b) MoWE utilizes a data-independent and a data-dependent router to selectively activate encoders for audio processing.
  • Figure 2: Proportion of samples assigned to each encoder by the data-dependent router. Note that HuBERT-base-ER is activated by the data-independent router for all samples.